-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Simplify Period/Datetime Array/Index constructors #23093
Conversation
Hello @jbrockmendel! Thanks for submitting the PR.
|
if freq is None and any(x is None for x in [periods, start, end]): | ||
raise ValueError('Must provide freq argument if no data is ' | ||
'supplied') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved these checks here from inside __new__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, as long as we're OK with ignoring gibberish arguments when there not actually used (which seems to be what we do right now).
In [9]: idx = pd.PeriodIndex(['2000', '2001'], freq='D')
In [10]: pd.PeriodIndex(idx, start='foo')
Out[10]: PeriodIndex(['2000-01-01', '2001-01-01'], dtype='period[D]', freq='D')
|
||
@classmethod | ||
def _from_ordinals(cls, values, freq=None): | ||
def _from_ordinals(cls, values, freq=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I'd like to get rid of _from_ordinals
and have _simple_new
be the lowest-level constructor like with all the other classes. This turns out to be infeasible ATM because a bunch of parent class methods operate on .values
instead of ._ndarray_values
and end up passing object-dtype to _shallow_copy
. This is being split off of a branch that is trying to avoid that.
periods = dtl.validate_periods(periods) | ||
return cls._generate_range(start, end, periods, freq, | ||
closed=closed) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still present in the TimedeltaIndex constructor, but its not too late to get it out of the TimedeltaArray constructor.
|
||
return cls._from_ordinals(values, freq) | ||
return cls._from_ordinals(values, freq=freq, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this causing bugs that kwargs was not passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary in order to allow PeriodIndex
to inherit _simple_new
from PeriodArrayMixin
.
The PeriodIndex constructors in particular (but also DatetimeIndex and TimedeltaIndex) are really messy and complicated. Inheriting methods and deleting as much as possible makes it easier for me to simplify them, even if some of that will end up getting copy/pasted down the road if/when inheritance is replaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you then move the implementation and share it by calling the other class instead of inheriting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean moving the _from_ordinals
implementation? We need both because the Index version sets self.name and calls _reset_identity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant in PeriodIndex._simple_new calling PeriodArrayMixin._simple_new, similar as you did in one of the previous PRs (didn't look at the code, so don't know if that is possible in this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I get what you're saying, will give it a shot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did anything come of this, one way or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In an earlier commit I deleted PeriodIndex._simple_new and retained PeriodIndex._from_ordinals. Now that has been reversed, and its non-trivially nicer.
""" | ||
Values should be int ordinals | ||
`__new__` & `_simple_new` cooerce to ordinals and call this method | ||
""" | ||
# **kwargs are included so that the signature matches PeriodIndex, | ||
# letting us share _simple_new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to share this? Once we actually don't inherit it anymore, we will need to add it back to PeriodIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the signature should match across index classes, and hopefully the implementation can be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this one is about _from_ordinals
, which will not be shared across index classes as it is Period-specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're moving towards getting rid of _from_ordinals
in order to make PeriodIndex work like all the others. To do that, we need to simplify _simple_new, which means fixing the places where inappropriate values
get passed to it, i.e. #23095.
raise TypeError("PeriodIndex can't take floats") | ||
return cls(values, name=name, freq=freq, **kwargs) | ||
|
||
return cls._from_ordinals(values, name, freq, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If keeping this now, I think it will be clearer what the changes are once we actually split index/array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters too much either way. @jbrockmendel does reverting this change cause issues with master? Or does the "old" PeriodIndex._simple_new
do the right thing, at the cost of code duplication? My PeriodArray PR is going to change this again anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has changed since joris's comment above, but this is just a matter of code duplication.
My PeriodArray PR is going to change this again anyway
I'm kind of hoping I can simplify these a good deal further before you finish up with SparseEA
freq=freq) | ||
if tz is not None: | ||
result = result.tz_localize('UTC').tz_convert(tz) | ||
return result | ||
return f | ||
elif klass == PeriodIndex: | ||
def f(values, freq=None, tz=None): | ||
return PeriodIndex._simple_new(values, None, freq=freq) | ||
return PeriodIndex._simple_new(values, name=None, freq=freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really these shouldn't be calling _simple_new at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, this is actually tricky. the period codes are actually stored. We in fact need to think about this for EA, potentially providing an easy way to have serialization code deal with this.
Why isn't this from_ordinals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out changing this to just call PeriodIndex(...) causes some test failures. Tried it in #23140.
Also, to what extent is not conflicting with the work @TomAugspurger is doing in the PeriodArray PR? |
I haven't looked through this yet, but the constructors are a major pain
point on that PR right now.
…On Thu, Oct 11, 2018 at 1:00 PM Joris Van den Bossche < ***@***.***> wrote:
Also, to what extent is not conflicting with the work @TomAugspurger
<https://github.com/TomAugspurger> is doing in the PeriodArray PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23093 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIvUHbNjhgrix0AH8hvpdLlyX6KWQks5uj4dTgaJpZM4XXx8L>
.
|
But the question is then maybe: do we want to solve that before or after that PR (or in) ? |
Codecov Report
@@ Coverage Diff @@
## master #23093 +/- ##
==========================================
+ Coverage 92.2% 92.21% +0.01%
==========================================
Files 169 169
Lines 50924 50920 -4
==========================================
+ Hits 46952 46955 +3
+ Misses 3972 3965 -7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we go too far down this path, can we come up with what we would like out
of our Index constructors?
Some _simple_new
s accept and use a dtype
. Others, like CategoricalIndex,
let you override the ordered or categories. I'd like to get away from that if
possible. I'd rather put the responsibility of getting the underlying data in
the right shape on the caller.
Ideally, I would like _simple_new
to have the signature
(Union[ndarray, ExtensionArray], name)
.
All _attributes
that we previously carried around on the instance would go on
the Array. This would make _simple_new
truly simple:
@classmethod
def _simple_new(cls, values, name=None):
# type: (Union[ndarray, ExtensionArray], Optional[Any]) -> Index
result = object.__new__(cls)
result._data = values
result.name = name
result._reset_identity()
return result
I haven't investigated whether this is feasible (it's certainly blocked by
DatetimeArray, since we need somewhere to put the .tz
).
Closely related, what do we want out of _shallow_copy
and
_shallow_copy_with_infer
? In what ways do they differ from a
return self._simple_new(self.values, name=self.name)
? Do they need to accept a
values
argument?
if freq is None and any(x is None for x in [periods, start, end]): | ||
raise ValueError('Must provide freq argument if no data is ' | ||
'supplied') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, as long as we're OK with ignoring gibberish arguments when there not actually used (which seems to be what we do right now).
In [9]: idx = pd.PeriodIndex(['2000', '2001'], freq='D')
In [10]: pd.PeriodIndex(idx, start='foo')
Out[10]: PeriodIndex(['2000-01-01', '2001-01-01'], dtype='period[D]', freq='D')
Fully agree on the proposed
We have been discussing those in #22961 as well |
I like this idea.
Given that we have an opportunity to start fresh with the EA subclasses, I think we should avoid letting this happen there. In the Index subclasses it would be nice to fix, but can be considered separately.
Part of my takeaway from #23095 is that moving away from both of these would be helpful medium-term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only question is on PeriodIndex._simple_new, but it doesn't matter either way. I'm fine with whatever gets us to our goal with the least amount of energy.
raise TypeError("PeriodIndex can't take floats") | ||
return cls(values, name=name, freq=freq, **kwargs) | ||
|
||
return cls._from_ordinals(values, name, freq, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters too much either way. @jbrockmendel does reverting this change cause issues with master? Or does the "old" PeriodIndex._simple_new
do the right thing, at the cost of code duplication? My PeriodArray PR is going to change this again anyway
freq=freq) | ||
if tz is not None: | ||
result = result.tz_localize('UTC').tz_convert(tz) | ||
return result | ||
return f | ||
elif klass == PeriodIndex: | ||
def f(values, freq=None, tz=None): | ||
return PeriodIndex._simple_new(values, None, freq=freq) | ||
return PeriodIndex._simple_new(values, name=None, freq=freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.
freq=freq) | ||
if tz is not None: | ||
result = result.tz_localize('UTC').tz_convert(tz) | ||
return result | ||
return f | ||
elif klass == PeriodIndex: | ||
def f(values, freq=None, tz=None): | ||
return PeriodIndex._simple_new(values, None, freq=freq) | ||
return PeriodIndex._simple_new(values, name=None, freq=freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, this is actually tricky. the period codes are actually stored. We in fact need to think about this for EA, potentially providing an easy way to have serialization code deal with this.
Why isn't this from_ordinals?
note there is a bug out there about deserialization of HDF5 Period types. This may solve it |
No idea what was originally intended; I want to go the other way with it and use a public constructor (i.e. |
ok this looks fine. |
Split off from the same branch that spawned #23083.