Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Simplify Period/Datetime Array/Index constructors #23093

Merged
merged 6 commits into from
Oct 12, 2018

Conversation

jbrockmendel
Copy link
Member

Split off from the same branch that spawned #23083.

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

if freq is None and any(x is None for x in [periods, start, end]):
raise ValueError('Must provide freq argument if no data is '
'supplied')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these checks here from inside __new__

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, as long as we're OK with ignoring gibberish arguments when there not actually used (which seems to be what we do right now).

In [9]: idx = pd.PeriodIndex(['2000', '2001'], freq='D')

In [10]: pd.PeriodIndex(idx, start='foo')
Out[10]: PeriodIndex(['2000-01-01', '2001-01-01'], dtype='period[D]', freq='D')


@classmethod
def _from_ordinals(cls, values, freq=None):
def _from_ordinals(cls, values, freq=None, **kwargs):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I'd like to get rid of _from_ordinals and have _simple_new be the lowest-level constructor like with all the other classes. This turns out to be infeasible ATM because a bunch of parent class methods operate on .values instead of ._ndarray_values and end up passing object-dtype to _shallow_copy. This is being split off of a branch that is trying to avoid that.

periods = dtl.validate_periods(periods)
return cls._generate_range(start, end, periods, freq,
closed=closed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still present in the TimedeltaIndex constructor, but its not too late to get it out of the TimedeltaArray constructor.


return cls._from_ordinals(values, freq)
return cls._from_ordinals(values, freq=freq, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this causing bugs that kwargs was not passed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary in order to allow PeriodIndex to inherit _simple_new from PeriodArrayMixin.

The PeriodIndex constructors in particular (but also DatetimeIndex and TimedeltaIndex) are really messy and complicated. Inheriting methods and deleting as much as possible makes it easier for me to simplify them, even if some of that will end up getting copy/pasted down the road if/when inheritance is replaced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you then move the implementation and share it by calling the other class instead of inheriting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean moving the _from_ordinals implementation? We need both because the Index version sets self.name and calls _reset_identity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant in PeriodIndex._simple_new calling PeriodArrayMixin._simple_new, similar as you did in one of the previous PRs (didn't look at the code, so don't know if that is possible in this case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get what you're saying, will give it a shot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did anything come of this, one way or the other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In an earlier commit I deleted PeriodIndex._simple_new and retained PeriodIndex._from_ordinals. Now that has been reversed, and its non-trivially nicer.

"""
Values should be int ordinals
`__new__` & `_simple_new` cooerce to ordinals and call this method
"""
# **kwargs are included so that the signature matches PeriodIndex,
# letting us share _simple_new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to share this? Once we actually don't inherit it anymore, we will need to add it back to PeriodIndex

Copy link
Contributor

@TomAugspurger TomAugspurger Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the signature should match across index classes, and hopefully the implementation can be shared.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this one is about _from_ordinals, which will not be shared across index classes as it is Period-specific

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're moving towards getting rid of _from_ordinals in order to make PeriodIndex work like all the others. To do that, we need to simplify _simple_new, which means fixing the places where inappropriate values get passed to it, i.e. #23095.

raise TypeError("PeriodIndex can't take floats")
return cls(values, name=name, freq=freq, **kwargs)

return cls._from_ordinals(values, name, freq, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If keeping this now, I think it will be clearer what the changes are once we actually split index/array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters too much either way. @jbrockmendel does reverting this change cause issues with master? Or does the "old" PeriodIndex._simple_new do the right thing, at the cost of code duplication? My PeriodArray PR is going to change this again anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has changed since joris's comment above, but this is just a matter of code duplication.

My PeriodArray PR is going to change this again anyway

I'm kind of hoping I can simplify these a good deal further before you finish up with SparseEA

freq=freq)
if tz is not None:
result = result.tz_localize('UTC').tz_convert(tz)
return result
return f
elif klass == PeriodIndex:
def f(values, freq=None, tz=None):
return PeriodIndex._simple_new(values, None, freq=freq)
return PeriodIndex._simple_new(values, name=None, freq=freq)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really these shouldn't be calling _simple_new at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, this is actually tricky. the period codes are actually stored. We in fact need to think about this for EA, potentially providing an easy way to have serialization code deal with this.

Why isn't this from_ordinals?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out changing this to just call PeriodIndex(...) causes some test failures. Tried it in #23140.

@jorisvandenbossche
Copy link
Member

Also, to what extent is not conflicting with the work @TomAugspurger is doing in the PeriodArray PR?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 11, 2018 via email

@jorisvandenbossche
Copy link
Member

But the question is then maybe: do we want to solve that before or after that PR (or in) ?

@codecov
Copy link

codecov bot commented Oct 11, 2018

Codecov Report

Merging #23093 into master will increase coverage by 0.01%.
The diff coverage is 93.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23093      +/-   ##
==========================================
+ Coverage    92.2%   92.21%   +0.01%     
==========================================
  Files         169      169              
  Lines       50924    50920       -4     
==========================================
+ Hits        46952    46955       +3     
+ Misses       3972     3965       -7
Flag Coverage Δ
#multiple 90.63% <90.32%> (+0.01%) ⬆️
#single 42.3% <58.06%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 95.72% <ø> (-0.02%) ⬇️
pandas/core/arrays/period.py 94.29% <100%> (+1.43%) ⬆️
pandas/core/indexes/period.py 93.43% <100%> (-0.02%) ⬇️
pandas/io/pytables.py 92.44% <100%> (ø) ⬆️
pandas/core/arrays/datetimes.py 96.94% <100%> (+0.02%) ⬆️
pandas/core/arrays/timedeltas.py 93.96% <66.66%> (+1.43%) ⬆️
pandas/core/arrays/datetimelike.py 95.41% <90%> (-0.15%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8ce3d0...04c75ca. Read the comment docs.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we go too far down this path, can we come up with what we would like out
of our Index constructors?

Some _simple_news accept and use a dtype. Others, like CategoricalIndex,
let you override the ordered or categories. I'd like to get away from that if
possible. I'd rather put the responsibility of getting the underlying data in
the right shape on the caller.

Ideally, I would like _simple_new to have the signature
(Union[ndarray, ExtensionArray], name).
All _attributes that we previously carried around on the instance would go on
the Array. This would make _simple_new truly simple:

@classmethod
def _simple_new(cls, values, name=None):
    # type: (Union[ndarray, ExtensionArray], Optional[Any]) -> Index
    result = object.__new__(cls)
    result._data = values
    result.name = name
    result._reset_identity()
    return result

I haven't investigated whether this is feasible (it's certainly blocked by
DatetimeArray, since we need somewhere to put the .tz).

Closely related, what do we want out of _shallow_copy and
_shallow_copy_with_infer? In what ways do they differ from a
return self._simple_new(self.values, name=self.name)? Do they need to accept a
values argument?

if freq is None and any(x is None for x in [periods, start, end]):
raise ValueError('Must provide freq argument if no data is '
'supplied')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, as long as we're OK with ignoring gibberish arguments when there not actually used (which seems to be what we do right now).

In [9]: idx = pd.PeriodIndex(['2000', '2001'], freq='D')

In [10]: pd.PeriodIndex(idx, start='foo')
Out[10]: PeriodIndex(['2000-01-01', '2001-01-01'], dtype='period[D]', freq='D')

@jorisvandenbossche
Copy link
Member

Fully agree on the proposed _simple_new interface.

Closely related, what do we want out of _shallow_copy and
_shallow_copy_with_infer? In what ways do they differ from a
return self._simple_new(self.values, name=self.name)? Do they need to accept a
values argument?

We have been discussing those in #22961 as well

@jbrockmendel
Copy link
Member Author

All _attributes that we previously carried around on the instance would go on the Array. This would make _simple_new truly simple

I like this idea.

pd.PeriodIndex(idx, start='foo')

Given that we have an opportunity to start fresh with the EA subclasses, I think we should avoid letting this happen there. In the Index subclasses it would be nice to fix, but can be considered separately.

Closely related, what do we want out of _shallow_copy and
_shallow_copy_with_infer?

Part of my takeaway from #23095 is that moving away from both of these would be helpful medium-term.

@jbrockmendel
Copy link
Member Author

I'm hopeful that we're close to a consensus here and in #23095 (and tangentially related #23031). If not, is there a non-controversial subset of this that can be broken off? These are blockers.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only question is on PeriodIndex._simple_new, but it doesn't matter either way. I'm fine with whatever gets us to our goal with the least amount of energy.

raise TypeError("PeriodIndex can't take floats")
return cls(values, name=name, freq=freq, **kwargs)

return cls._from_ordinals(values, name, freq, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters too much either way. @jbrockmendel does reverting this change cause issues with master? Or does the "old" PeriodIndex._simple_new do the right thing, at the cost of code duplication? My PeriodArray PR is going to change this again anyway

freq=freq)
if tz is not None:
result = result.tz_localize('UTC').tz_convert(tz)
return result
return f
elif klass == PeriodIndex:
def f(values, freq=None, tz=None):
return PeriodIndex._simple_new(values, None, freq=freq)
return PeriodIndex._simple_new(values, name=None, freq=freq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

freq=freq)
if tz is not None:
result = result.tz_localize('UTC').tz_convert(tz)
return result
return f
elif klass == PeriodIndex:
def f(values, freq=None, tz=None):
return PeriodIndex._simple_new(values, None, freq=freq)
return PeriodIndex._simple_new(values, name=None, freq=freq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, this is actually tricky. the period codes are actually stored. We in fact need to think about this for EA, potentially providing an easy way to have serialization code deal with this.

Why isn't this from_ordinals?

@jreback
Copy link
Contributor

jreback commented Oct 12, 2018

note there is a bug out there about deserialization of HDF5 Period types. This may solve it

@jreback jreback added Period Period data type Clean labels Oct 12, 2018
@jreback jreback added this to the 0.24.0 milestone Oct 12, 2018
@jbrockmendel
Copy link
Member Author

Why isn't this from_ordinals?

No idea what was originally intended; I want to go the other way with it and use a public constructor (i.e. __new__)

@jreback
Copy link
Contributor

jreback commented Oct 12, 2018

ok this looks fine.

@jreback jreback merged commit e4b67ca into pandas-dev:master Oct 12, 2018
@jbrockmendel jbrockmendel deleted the pi_cons branch October 12, 2018 22:15
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants