Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Index and Array constructors design #23212

Closed
jorisvandenbossche opened this issue Oct 17, 2018 · 23 comments
Closed

API: Index and Array constructors design #23212

jorisvandenbossche opened this issue Oct 17, 2018 · 23 comments
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Constructors Series/DataFrame/Index/pd.array Constructors

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 17, 2018

To split off the discussion on the constructors from #23185, to have a more focussed discussion about that here. Also going further on the discussion we were having in https://github.com/pandas-dev/pandas/pull/23140/files#r225218594

So topic of this issue: how should the different constructors look like for the internal EAs and the Index classes based on those EAs (specifically for the datetimelike ones).

Index constructors

I think the for the Index constructors, there is not that much discussion.
We have:

  • default Index(..) (__new__ or __init__): this is quite overloaded for some of the index classes, but that's the way it is now since they are exposed to the user.

  • _simple_new: I think we agree that for those (from Tom's comment here REF: Simplify Period/Datetime Array/Index constructors #23093 (review)), it should basically simply get the EA array and potentially a name:

    @classmethod
    def _simple_new(cls, values, name=None):
        # type: (Union[ndarray, ExtensionArray], Optional[Any]) -> Index
        result = object.__new__(cls)
        result._data = values
        result.name = name
        result._reset_identity()
        return result
    
  • _shallow_copy and _shallow_copy_with_infer might need another look to propose something.

Array constructors

The default Index constructors mix a lot of different things (which is what partly lead to the suite of other constructors), and I personally don't think this is something we necessarily need to repeat for the Array constructors.

In the discussion related to

Each Array type might have it specific constructors (like we have IntervalArray.from_breaks and others), but I think that in the discussion we were having in https://github.com/pandas-dev/pandas/pull/23140/files#r225218594, there are 3 clearly defined use case that are generic for the different dtypes. Constructing from:

  1. physical values (ordinals + freq for Period, datetime64 + optional tz for Datetime, int ndarray + mask for IntegerArray)
  2. extension Array (i.e. accept itself)
  3. array of scalars (eg an object ndarray of Period or Timestamp objects)

For this last item, we already _from_sequence for exactly this as part of the EA interface.

So one option is simply accept all of those three things in the main Array __init__, another option is to have separate constructors for them. I think this is what the discussion is mainly about?

I see the following advantages of keeping them separate (or at least keep the third item separate):

  • Code clarity throughout the Array implementation: To quote Tom from (REF: Simplify Datetimelike constructor dispatching #23140 (comment)):

    From the WIP PeriodArray PR, I found that having to think carefully about what type of data I had forced some clarity in the code. I liked having to explicitly reach for that _from_periods constructor.

  • Keep concerns separated inside the constructor -> code clarity in the constructors itself
  • We already decided that we will not rely on the default constructor in the EA interface but rather have a specific _from_sequence, _from_factorized, so we cannot use it anyway in places that need to deal with EAs in general

Also note that this is basically what we have for the new IntegerArray. It's __init__ only accepts a ndarray of integers + mask, and there is a separate function integer_array that provides a more general purpose constructor (from list, from floats, detecting NaNs as missing values, etc ..), which is then used in _from_sequence.

cc @TomAugspurger @jreback @jbrockmendel

@TomAugspurger
Copy link
Contributor

My opening statement below:

In a lot of places, the caller has some information about the type of data that's being passed to create a new PeriodArray. Consider DatetimeIndex.to_period. It knows that data is an ndarray[dateime64[ns]]. I think it's wasteful to do a bunch of dtype and isinstance checks in a forgiving PeriodArray.__init__, when you already know you have valid data of a certain form. So in my PR I add a PeriodArray._from_datetime64 constructor for cases like this, that just calls dt64arr_to_periodarr before passing that to _simple_new.

On having a flexible __init__, I don't have too strong of opinions. @jbrockmendel mentioned a desire to have PeriodArray(list(array)) work, a kind of round-tripping. On my PeriodArray PR, I kind of appreciate that that raises right now, because it highlights places where we're doing an operations that requires a copy / dtype conversion, when we may be able to do it without any copies. But I don't know if that means we should keep it that way in the end. Right now my preferred heuristic is "Can we do this without copying or inference? If so, we can do it in the init". So PeriodArray[PeriodIndex] or PeriodArray[Series[Period]] are both fine. But PeriodArray[ndarray[object]] isn't (IMO).

@jorisvandenbossche
Copy link
Member Author

More concrete, on the PeriodArray PR, there is currently:

  • __init__: can take one of:
    • PeriodArray (instance of itself, or a container from which that can be extracted)
    • ordinals + freq
  • _complex_new: the hodge podge that PeriodIndex can do
  • _simple_new: same as __init__ ? (not fully sure) -> so can remove?
  • _from_periods: from object ndarray (or list) of Periods
  • _from_ordinals: same as __init__ -> worth keeping just for the explicit name, or just use __init__ everywhere for this case? (if we keep it, I think we should forbid EA in this path)
  • _from_datetime64: specific path for datetime64 array
  • _from_sequence mapping to _from_periods and _from_factorized mapping to __init__ (both for the EA interface)

@TomAugspurger I think this can be simplified a bit: I would move _complex_new out of the class, and make this a helper function somewhere (could maybe also use some refactoring/clean-up in a follow-up).
Further, I think we can reduce __init__, _simple_new, and _from_ordinals to simply the __int__ (unless we want to keep _from_ordinals for it's more explicit name).
Then, the question is also if we need _from_periods since we already have _from_sequence for this, a method we need anyway for the EA interface.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 17, 2018

Further, I think we can reduce ...

Just did this locally. Will push once tests all finish. Right now I have

  • _from_sequence(cls, scalars: Sequence[Period], dtype: PeriodDtype, copy=False)
  • _from_factorized(cls, values: ndarray[int], PeriodArray)
  • _from_datetime64(cls, values: ndarray[datetime64[ns]], freq: Freq, tz: Optional[tzinfo])

The PeriodArray takes

  • ndarary[int], the physical storage values, and a freq. For datetime / timedelta array this would just be the ndarray of datetime64ns / timedelta64ns
  • Series[Period] and Index[Period]. We just unbox these.

everything else raises. Things look quite nice from a code clarity POV, and from a future code-reuse with DatetimeArray / TimedeltaArray.

Specifically, I think removing _from_ordinals is OK if we're able to say that __init__ handles the physical storage without an ambiguity. If we can't say that, (say because PeriodArray.__init__ takes an ndarray of integers like [2000, 2001] and a freq), then we would need to add _from_ordinals back for clarity).

@jorisvandenbossche
Copy link
Member Author

That sounds good. Only, is the _from_factorized a typo? As I thought the factorization values were the ordinals, not Period objects.

@TomAugspurger
Copy link
Contributor

Yes, a typo, fixed. I wish we could really run MyPy on these types :)

@jreback
Copy link
Contributor

jreback commented Oct 17, 2018

private from methods for construction seem fine as long as they are well defined and limited

i DOnt want to see public from_ methods

@jorisvandenbossche
Copy link
Member Author

i DOnt want to see public from_ methods

And do you see any in the discussion above? (I don't think the discussion is about this)
That said, IntervalAray has from_breaks, from_array and from_tuples, is that a problem? Or how would you like to see this differently?

@jreback
Copy link
Contributor

jreback commented Oct 17, 2018

no that’s the point - I thought u were adding public methods

@jorisvandenbossche
Copy link
Member Author

This discussion is about the different internal constructors (and also __init__, since that overlaps with that)

@jreback
Copy link
Contributor

jreback commented Oct 17, 2018

@jorisvandenbossche I understand. However things get conflated and we end up talking about everything.

For PeriodArray itself I think you can just adopt the current structure, meaning __init__ is relatively lmited function (IOW it can only accept a PeriodArray itself), and we have a period_array function that can do inference (e.g. accept a list-like of ndarrays). I think this would satisfy the public interface.

The physical values (ordinals) should be by private from constructor. I think arrays of scalars can be converted by period_array. I have an aversion to having a separate construction from a list-like of scalars.

@TomAugspurger
Copy link
Contributor

IOW it can only accept a PeriodArray itself

I'm finding it somewhat convenient, and not costly, to also accept Series[period] and PeriodIndex as well.

Aside from that, it seems like we're in agreement.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2018

yep that seems ok to me.

just want these to be consistent as possible across the EA, the model we have now is integer_array (though period_array is doing more)

@TomAugspurger
Copy link
Contributor

Are wee good with closing this? Should we wait till DatetimeArray is done?

@jbrockmendel
Copy link
Member

A change I've implemented in a WIP DatetimeArray branch that I'd like to get OKed here: putting dtype in all of the DatetimeArray/TimedeltaArray/PeriodArray __init__ methods.

This makes code-sharing much easier, we type(self)(i8values, freq=self.freq, dtype=self.dtype) works in all cases, in particular without a need to special-case DatetimeTZDtype.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 2, 2018 via email

@jbrockmendel
Copy link
Member

Should PeriodArray be refactored to remove freq

I don't think so; the others still take both freq and dtype.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 2, 2018 via email

@jbrockmendel
Copy link
Member

Great. I'll make a PR for this edit after #23433 goes through.

@jorisvandenbossche
Copy link
Member Author

If we want to keep the constructors simple, and not have a bunch of redundant / overlapping keywords, we could also have a specific constructor for this path that can be used in the shared functionality?

@jbrockmendel
Copy link
Member

I've been asked to reiterate my thoughts here.

Series(...) is very forgiving/smart about what it accepts. So is DataFrame. So is Index and DatetimeIndex and PeriodIndex and ... (less so for MultiIndex and I'm honestly not sure about IntervalIndex, but the pattern/policy is clear). If pandas-provided EAs are first-class pandas classes, they should behave in the same user-friendly way as everything else.

I see this as the default, with deviations from it needing justification, not the other way around.

In #23493 Tom wrote:

I thought we were all on board with the goal of the DatetimelikeArray.__init__ being no inference and no copy.

Is it safe to assume that this near-consensus applies to TimedeltaArray and PeriodArray as well? How about other future pandas-provided EAs? For the moment I'm going to assume that it applies to DTA/TDA/PA.

... why would this be the goal? You want a constructor that is no-inference and no-copy? Congratulations, you've just invented _simple_new. You want _from_sequence to be the all-purpose any-sequence constructor? Everywhere else we call that __init__/__new__. This policy makes private the most useful thing and public the least useful thing. It makes zero sense to me.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 9, 2018 via email

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@jbrockmendel
Copy link
Member

Is there anything left to do/discuss here?

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label May 18, 2021
@mroeschke
Copy link
Member

Seems like this issue has evolved and encompassed a lot of topics while sufficiently stalling. I think it's best if separate issues were opened if specific followups need tackling. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Constructors Series/DataFrame/Index/pd.array Constructors
Projects
None yet
Development

No branches or pull requests

5 participants