API: Index and Array constructors design #23212

jorisvandenbossche · 2018-10-17T19:15:55Z

To split off the discussion on the constructors from #23185, to have a more focussed discussion about that here. Also going further on the discussion we were having in https://github.com/pandas-dev/pandas/pull/23140/files#r225218594

So topic of this issue: how should the different constructors look like for the internal EAs and the Index classes based on those EAs (specifically for the datetimelike ones).

Index constructors

I think the for the Index constructors, there is not that much discussion.
We have:

default Index(..) (__new__ or __init__): this is quite overloaded for some of the index classes, but that's the way it is now since they are exposed to the user.

_simple_new: I think we agree that for those (from Tom's comment here REF: Simplify Period/Datetime Array/Index constructors #23093 (review)), it should basically simply get the EA array and potentially a name:

@classmethod
def _simple_new(cls, values, name=None):
    # type: (Union[ndarray, ExtensionArray], Optional[Any]) -> Index
    result = object.__new__(cls)
    result._data = values
    result.name = name
    result._reset_identity()
    return result

_shallow_copy and _shallow_copy_with_infer might need another look to propose something.

Array constructors

The default Index constructors mix a lot of different things (which is what partly lead to the suite of other constructors), and I personally don't think this is something we necessarily need to repeat for the Array constructors.

In the discussion related to

Each Array type might have it specific constructors (like we have IntervalArray.from_breaks and others), but I think that in the discussion we were having in https://github.com/pandas-dev/pandas/pull/23140/files#r225218594, there are 3 clearly defined use case that are generic for the different dtypes. Constructing from:

physical values (ordinals + freq for Period, datetime64 + optional tz for Datetime, int ndarray + mask for IntegerArray)
extension Array (i.e. accept itself)
array of scalars (eg an object ndarray of Period or Timestamp objects)

For this last item, we already _from_sequence for exactly this as part of the EA interface.

So one option is simply accept all of those three things in the main Array __init__, another option is to have separate constructors for them. I think this is what the discussion is mainly about?

I see the following advantages of keeping them separate (or at least keep the third item separate):

Code clarity throughout the Array implementation: To quote Tom from (REF: Simplify Datetimelike constructor dispatching #23140 (comment)):

From the WIP PeriodArray PR, I found that having to think carefully about what type of data I had forced some clarity in the code. I liked having to explicitly reach for that _from_periods constructor.
Keep concerns separated inside the constructor -> code clarity in the constructors itself
We already decided that we will not rely on the default constructor in the EA interface but rather have a specific _from_sequence, _from_factorized, so we cannot use it anyway in places that need to deal with EAs in general

Also note that this is basically what we have for the new IntegerArray. It's __init__ only accepts a ndarray of integers + mask, and there is a separate function integer_array that provides a more general purpose constructor (from list, from floats, detecting NaNs as missing values, etc ..), which is then used in _from_sequence.

cc @TomAugspurger @jreback @jbrockmendel

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-10-17T19:25:17Z

My opening statement below:

In a lot of places, the caller has some information about the type of data that's being passed to create a new PeriodArray. Consider DatetimeIndex.to_period. It knows that data is an ndarray[dateime64[ns]]. I think it's wasteful to do a bunch of dtype and isinstance checks in a forgiving PeriodArray.__init__, when you already know you have valid data of a certain form. So in my PR I add a PeriodArray._from_datetime64 constructor for cases like this, that just calls dt64arr_to_periodarr before passing that to _simple_new.

On having a flexible __init__, I don't have too strong of opinions. @jbrockmendel mentioned a desire to have PeriodArray(list(array)) work, a kind of round-tripping. On my PeriodArray PR, I kind of appreciate that that raises right now, because it highlights places where we're doing an operations that requires a copy / dtype conversion, when we may be able to do it without any copies. But I don't know if that means we should keep it that way in the end. Right now my preferred heuristic is "Can we do this without copying or inference? If so, we can do it in the init". So PeriodArray[PeriodIndex] or PeriodArray[Series[Period]] are both fine. But PeriodArray[ndarray[object]] isn't (IMO).

jorisvandenbossche · 2018-10-17T19:46:45Z

More concrete, on the PeriodArray PR, there is currently:

__init__: can take one of:
- PeriodArray (instance of itself, or a container from which that can be extracted)
- ordinals + freq
_complex_new: the hodge podge that PeriodIndex can do
_simple_new: same as __init__ ? (not fully sure) -> so can remove?
_from_periods: from object ndarray (or list) of Periods
_from_ordinals: same as __init__ -> worth keeping just for the explicit name, or just use __init__ everywhere for this case? (if we keep it, I think we should forbid EA in this path)
_from_datetime64: specific path for datetime64 array
_from_sequence mapping to _from_periods and _from_factorized mapping to __init__ (both for the EA interface)

@TomAugspurger I think this can be simplified a bit: I would move _complex_new out of the class, and make this a helper function somewhere (could maybe also use some refactoring/clean-up in a follow-up).
Further, I think we can reduce __init__, _simple_new, and _from_ordinals to simply the __int__ (unless we want to keep _from_ordinals for it's more explicit name).
Then, the question is also if we need _from_periods since we already have _from_sequence for this, a method we need anyway for the EA interface.

TomAugspurger · 2018-10-17T19:53:43Z

Further, I think we can reduce ...

Just did this locally. Will push once tests all finish. Right now I have

_from_sequence(cls, scalars: Sequence[Period], dtype: PeriodDtype, copy=False)
_from_factorized(cls, values: ndarray[int], PeriodArray)
_from_datetime64(cls, values: ndarray[datetime64[ns]], freq: Freq, tz: Optional[tzinfo])

The PeriodArray takes

ndarary[int], the physical storage values, and a freq. For datetime / timedelta array this would just be the ndarray of datetime64ns / timedelta64ns
Series[Period] and Index[Period]. We just unbox these.

everything else raises. Things look quite nice from a code clarity POV, and from a future code-reuse with DatetimeArray / TimedeltaArray.

Specifically, I think removing _from_ordinals is OK if we're able to say that __init__ handles the physical storage without an ambiguity. If we can't say that, (say because PeriodArray.__init__ takes an ndarray of integers like [2000, 2001] and a freq), then we would need to add _from_ordinals back for clarity).

jorisvandenbossche · 2018-10-17T19:57:32Z

That sounds good. Only, is the _from_factorized a typo? As I thought the factorization values were the ordinals, not Period objects.

TomAugspurger · 2018-10-17T20:01:26Z

Yes, a typo, fixed. I wish we could really run MyPy on these types :)

jreback · 2018-10-17T20:36:55Z

private from methods for construction seem fine as long as they are well defined and limited

i DOnt want to see public from_ methods

jorisvandenbossche · 2018-10-17T20:46:26Z

i DOnt want to see public from_ methods

And do you see any in the discussion above? (I don't think the discussion is about this)
That said, IntervalAray has from_breaks, from_array and from_tuples, is that a problem? Or how would you like to see this differently?

jreback · 2018-10-17T20:57:51Z

no that’s the point - I thought u were adding public methods

jorisvandenbossche · 2018-10-17T21:03:16Z

This discussion is about the different internal constructors (and also __init__, since that overlaps with that)

jreback · 2018-10-17T22:34:30Z

@jorisvandenbossche I understand. However things get conflated and we end up talking about everything.

For PeriodArray itself I think you can just adopt the current structure, meaning __init__ is relatively lmited function (IOW it can only accept a PeriodArray itself), and we have a period_array function that can do inference (e.g. accept a list-like of ndarrays). I think this would satisfy the public interface.

The physical values (ordinals) should be by private from constructor. I think arrays of scalars can be converted by period_array. I have an aversion to having a separate construction from a list-like of scalars.

TomAugspurger · 2018-10-17T22:38:19Z

IOW it can only accept a PeriodArray itself

I'm finding it somewhat convenient, and not costly, to also accept Series[period] and PeriodIndex as well.

Aside from that, it seems like we're in agreement.

jreback · 2018-10-17T22:46:08Z

yep that seems ok to me.

just want these to be consistent as possible across the EA, the model we have now is integer_array (though period_array is doing more)

TomAugspurger · 2018-11-02T17:27:53Z

Are wee good with closing this? Should we wait till DatetimeArray is done?

jbrockmendel · 2018-11-02T21:22:09Z

A change I've implemented in a WIP DatetimeArray branch that I'd like to get OKed here: putting dtype in all of the DatetimeArray/TimedeltaArray/PeriodArray __init__ methods.

This makes code-sharing much easier, we type(self)(i8values, freq=self.freq, dtype=self.dtype) works in all cases, in particular without a need to special-case DatetimeTZDtype.

TomAugspurger · 2018-11-02T21:42:33Z

That's fine. Should PeriodArray be refactored to remove freq, and instead take a dtype? Then we'd have a consistent - data (ndarray) - dtype (PeriodDtype, DatetimeTZDtype, datetime64[ns] - copy

…

On Fri, Nov 2, 2018 at 2:23 PM jbrockmendel ***@***.***> wrote: A change I've implemented in a WIP DatetimeArray branch that I'd like to get OKed here: putting dtype in all of the DatetimeArray/TimedeltaArray/PeriodArray __init__ methods. This makes code-sharing much easier, we type(self)(i8values, freq=self.freq, dtype=self.dtype) works in all cases, in particular without a need to special-case DatetimeTZDtype. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23212 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIlr11rYJ0fEG1Va82wjnbKCEiSCCks5urLe1gaJpZM4XlBHB> .

jbrockmendel · 2018-11-02T22:47:13Z

Should PeriodArray be refactored to remove freq

I don't think so; the others still take both freq and dtype.

TomAugspurger · 2018-11-02T22:54:14Z

Mmm fair point.

…

________________________________ From: jbrockmendel <notifications@github.com> Sent: Friday, November 2, 2018 3:47:17 PM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] API: Index and Array constructors design (#23212) Should PeriodArray be refactored to remove freq I don't think so; the others still take both freq and dtype. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#23212 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIm3KzEsI067w4UYYQkW-0bWZXoX-ks5urMt1gaJpZM4XlBHB>.

jbrockmendel · 2018-11-02T23:28:38Z

Great. I'll make a PR for this edit after #23433 goes through.

jorisvandenbossche · 2018-11-05T14:14:33Z

If we want to keep the constructors simple, and not have a bunch of redundant / overlapping keywords, we could also have a specific constructor for this path that can be used in the shared functionality?

jbrockmendel · 2018-11-08T23:36:36Z

I've been asked to reiterate my thoughts here.

Series(...) is very forgiving/smart about what it accepts. So is DataFrame. So is Index and DatetimeIndex and PeriodIndex and ... (less so for MultiIndex and I'm honestly not sure about IntervalIndex, but the pattern/policy is clear). If pandas-provided EAs are first-class pandas classes, they should behave in the same user-friendly way as everything else.

I see this as the default, with deviations from it needing justification, not the other way around.

In #23493 Tom wrote:

I thought we were all on board with the goal of the DatetimelikeArray.__init__ being no inference and no copy.

Is it safe to assume that this near-consensus applies to TimedeltaArray and PeriodArray as well? How about other future pandas-provided EAs? For the moment I'm going to assume that it applies to DTA/TDA/PA.

... why would this be the goal? You want a constructor that is no-inference and no-copy? Congratulations, you've just invented _simple_new. You want _from_sequence to be the all-purpose any-sequence constructor? Everywhere else we call that __init__/__new__. This policy makes private the most useful thing and public the least useful thing. It makes zero sense to me.

TomAugspurger · 2018-11-09T03:07:00Z

I see this as the default, with deviations from it needing justification,

not the other way around. I (and several other maintainers) appreciate the flexibility of series, but have grown to dislike the edge cases, difficult to understand code, and performance costs associated with it. We'll still need the complex code in whatever function is going to the complex work of taking user-provided input and wrangling it into an array, but we don't need to pay that cost every time we look create an array.

This policy makes private the most useful thing and public the least

useful thing. How so? The user-friendly API will still be available via. `pd.array()`. While things like TimedeltaArray.__init__ are of course public, they won't be recommend for use. FYI: IntervalIndex only accepts a sequence of Intervals. It doesn't attempt any kind of fancy / forgiving initialization (e.g. from a list of tuples).

…

On Thu, Nov 8, 2018 at 5:36 PM jbrockmendel ***@***.***> wrote: I've been asked to reiterate my thoughts here. Series(...) is very forgiving/smart about what it accepts. So is DataFrame. So is Index and DatetimeIndex and PeriodIndex and ... (less so for MultiIndex and I'm honestly not sure about IntervalIndex, but the pattern/policy is clear). If pandas-provided EAs are first-class pandas classes, they should behave in the same user-friendly way as everything else. I see this as the default, with deviations from it needing justification, not the other way around. In #23493 <#23493> Tom wrote: I thought we were all on board with the goal of the DatetimelikeArray.__init__ being no inference and no copy. Is it safe to assume that this near-consensus applies to TimedeltaArray and PeriodArray as well? How about other future pandas-provided EAs? For the moment I'm going to assume that it applies to DTA/TDA/PA. ... why would this be the goal? You want a constructor that is no-inference and no-copy? Congratulations, you've just invented _simple_new. You want _from_sequence to be the all-purpose any-sequence constructor? Everywhere else we call that __init__/__new__. This policy makes private the most useful thing and public the least useful thing. It makes zero sense to me. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23212 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInkwCnx3TXCg1WAvpRD5dY_FD-FQks5utMALgaJpZM4XlBHB> .

jbrockmendel · 2021-05-18T20:47:30Z

Is there anything left to do/discuss here?

mroeschke · 2022-07-05T23:26:18Z

Seems like this issue has evolved and encompassed a lot of topics while sufficiently stalling. I think it's best if separate issues were opened if specific followups need tackling. Closing.

jorisvandenbossche mentioned this issue Oct 19, 2018

REF: Make PeriodArray an ExtensionArray #22862

Merged

jbrockmendel added the API Design label Oct 21, 2018

This was referenced Nov 5, 2018

REF/ENH: Constructors for DatetimeArray/TimedeltaArray #23493

Closed

BUG/REF: TimedeltaIndex.__new__ #23539

Merged

jorisvandenbossche mentioned this issue Jan 8, 2019

EA: revert treatment of i8values #24623

Closed

1 task

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label May 18, 2021

mroeschke closed this as completed Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Index and Array constructors design #23212

API: Index and Array constructors design #23212

jorisvandenbossche commented Oct 17, 2018 •

edited

Loading

TomAugspurger commented Oct 17, 2018

jorisvandenbossche commented Oct 17, 2018

TomAugspurger commented Oct 17, 2018 •

edited

Loading

jorisvandenbossche commented Oct 17, 2018

TomAugspurger commented Oct 17, 2018

jreback commented Oct 17, 2018

jorisvandenbossche commented Oct 17, 2018

jreback commented Oct 17, 2018

jorisvandenbossche commented Oct 17, 2018

jreback commented Oct 17, 2018

TomAugspurger commented Oct 17, 2018

jreback commented Oct 17, 2018

TomAugspurger commented Nov 2, 2018

jbrockmendel commented Nov 2, 2018

TomAugspurger commented Nov 2, 2018 via email

jbrockmendel commented Nov 2, 2018

TomAugspurger commented Nov 2, 2018 via email

jbrockmendel commented Nov 2, 2018

jorisvandenbossche commented Nov 5, 2018

jbrockmendel commented Nov 8, 2018

TomAugspurger commented Nov 9, 2018 via email

jbrockmendel commented May 18, 2021

mroeschke commented Jul 5, 2022

API: Index and Array constructors design #23212

API: Index and Array constructors design #23212

Comments

jorisvandenbossche commented Oct 17, 2018 • edited Loading

Index constructors

Array constructors

TomAugspurger commented Oct 17, 2018

jorisvandenbossche commented Oct 17, 2018

TomAugspurger commented Oct 17, 2018 • edited Loading

jorisvandenbossche commented Oct 17, 2018

TomAugspurger commented Oct 17, 2018

jreback commented Oct 17, 2018

jorisvandenbossche commented Oct 17, 2018

jreback commented Oct 17, 2018

jorisvandenbossche commented Oct 17, 2018

jreback commented Oct 17, 2018

TomAugspurger commented Oct 17, 2018

jreback commented Oct 17, 2018

TomAugspurger commented Nov 2, 2018

jbrockmendel commented Nov 2, 2018

TomAugspurger commented Nov 2, 2018 via email

jbrockmendel commented Nov 2, 2018

TomAugspurger commented Nov 2, 2018 via email

jbrockmendel commented Nov 2, 2018

jorisvandenbossche commented Nov 5, 2018

jbrockmendel commented Nov 8, 2018

TomAugspurger commented Nov 9, 2018 via email

jbrockmendel commented May 18, 2021

mroeschke commented Jul 5, 2022

jorisvandenbossche commented Oct 17, 2018 •

edited

Loading

TomAugspurger commented Oct 17, 2018 •

edited

Loading