Index Constructors inferring output from data #17246

TomAugspurger · 2017-08-14T11:03:04Z

Two proposals:

Consolidate all inference to the `Index` constructor

Retain Index(...) inferring the best container for the data passed
Remove MultiIndex(data) returning an Index when data is a list of length-1 tuples (xref API: Have MultiIndex consturctors always return a MI #17236)

Passing `dtype=object` disables inference

Index(..., dtype=object) disable all inference. So Index([1, 2], dtype=object) will give you an Index instead of Int64Index, and Index([(1, 'a'), (2, 'b')], dtype=object) an Index instead of MultiIndex, etc.

(original post follows)

Or how much magic should we have in the Index constructors? Currently we infer the index type from the data, which is often convenient, but sometime difficult to reason able behavior. e.g. hash_tuples currently doesn't work if your tuples all happen to be length 1, since it uses a MultiIndex internally.

Do we want to make our Index constructors more predictable? For reference, here are some examples:

>>> import pandas as pd
# 1.) Index -> MultiIndex
>>> pd.Index([(1, 2), (3, 4)])
MultiIndex(levels=[[1, 3], [2, 4]],
           labels=[[0, 1], [0, 1]])

>>> pd.Index([(1, 2), (3, 4)], tupleize_cols=False)
Index([(1, 2), (3, 4)], dtype='object')

# 2.) Index -> Int64Index
>>> pd.Index([1, 2, 3, 4, 5])
Int64Index([1, 2, 3, 4, 5], dtype='int64')

# 3.) Index -> RangeIndex
>>> pd.Index(range(1, 5))
RangeIndex(start=1, stop=5, step=1)

# 4.) Index -> DatetimeIndex
>>> pd.Index([pd.Timestamp('2017'), pd.Timestamp('2018')])
DatetimeIndex(['2017-01-01', '2018-01-01'], dtype='datetime64[ns]', freq=None)

# 5.) Index -> IntervalIndex
>>> pd.Index([pd.Interval(3, 4), pd.Interval(4, 5)])
IntervalIndex([(3, 4], (4, 5]]
              closed='right',
              dtype='interval[int64]')

# 5.) MultiIndex -> Index
>>> pd.MultiIndex.from_tuples([(1,), (2,), (3,)])
Int64Index([1, 2, 3], dtype='int64')

Of these, I think the first (Index -> MultiIndex if you have tuples) and the last (MultiIndex -> Index if you're tuples are all length 1) are undesirable. The Index -> MultiIndex one has the tupleize_cols keyword to control this behavior. In #17236 I add an analogous keyword to the MI constructor. The rest are probably fine, but I don't have any real reason for saying that [1, 2, 3] magically returning an Int64Index is ok, but [(1, 2), (3, 4)] returning a MI isn't (maybe the difference between a MI and Index is larger than the difference between an Int64Index and Index?). I believe that in either the RangeIndex or IntervalIndex someone (@shoyer?) had objections to overloading the Index constructor to return the specialized type.

So, what should we do about these? Leave them as is? Deprecate the type inference? My vote is for merging #17236 and leaving everything else as is. To me, it's not worth breaking API over.

cc @jreback, @jorisvandenbossche, @shoyer

The text was updated successfully, but these errors were encountered:

shoyer · 2017-08-14T15:49:19Z

I like how the generic pandas.Index() constructor to do type inference. This is convenient and usually helpful. (Though if we were starting over from scratch, I might encourage creating a separate constructor function, e.g., pd.as_index() to do type inference.)

I'm not a big fan of separate keyword arguments like tupleize_cols and squeeze. Instead, I would suggest:

Setting dtype on pd.Index controls the type of index created, e.g., dtype=object guarantees a base Index object.
I think we should simply remove outright the MultiIndex -> Index behavior with length one tuples. I understand the intention of avoiding multi-indexes with one level, but the inconsistency is grating. Every time I've encountered this, I've either used another work-around to create a MultiIndex or reverted to the base Index constructor.

TomAugspurger · 2017-08-14T15:55:24Z

I think we should simply remove outright the MultiIndex -> Index

OK, since both you and @jreback are in favor of that over a keyword, I'll amend #17236 to just remove that behavior (without a deprecation cycle I suppose?).

And just to be clear @shoyer, you're in favor of keeping Index([('a, 'b'), ('c', 'd')]) returning a MI, since that's consistent with Index doing inference?

shoyer · 2017-08-14T16:03:14Z

OK, since both you and @jreback are in favor of that over a keyword, I'll amend #17236 to just remove that behavior (without a deprecation cycle I suppose?).

Sounds good to me. This should be mentioned as a breaking change, of course. I can't think of any real use cases for this behavior, but I'm sure this will still come up for someone somehow!

And just to be clear @shoyer, you're in favor of keeping Index([('a, 'b'), ('c', 'd')]) returning a MI, since that's consistent with Index doing inference?

Yes, that seems consistent to me. I would suggest pd.Index([('a', 'b'), ('c', 'd')], dtype=object) as a good well to spell creating an Index of tuples.

jorisvandenbossche · 2017-08-16T10:21:46Z

See #17236 (comment), I am also +1 on making the MultiIndex constructors consistently return MultiIndex (so remove the MultiIndex -> Index way)

TomAugspurger · 2017-08-21T15:31:41Z

So it seems like the consensus is to put all the inference into Index and remove it from others (specifically MultiIndex of all length-1 tuples will be a MultiIndex instead of an Index).

The second idea is to have Index(..., dtype=object) disable all inference. So Index([1, 2], dtype=object) will give you an Index instead of Int64Index, and Index([(1, 'a'), (2, 'b')], dtype=object) an Index instead of MultiIndex, etc.

I'll update the top post.

jreback · 2017-08-22T11:22:45Z

this already does what you expect

In [2]: pd.Index([0, 1], dtype=object)
Out[2]: Index([0, 1], dtype='object')

but eliminating the special case is important as well.

jorisvandenbossche · 2017-08-22T11:32:51Z

@TomAugspurger yep, agree with your summary

The dtype=object to disable interference indeed already works for most data, but not yet for tuples (they still become a MultiIndex). So would be good to fix that.

jreback · 2017-08-22T12:00:23Z

ahh I would be ok with fixing this. This is not respecting the dtype.

In [1]: pd.Index([(0,), (1,)], dtype=object)
Out[1]: Int64Index([0, 1], dtype='int64')

jorisvandenbossche · 2017-08-22T12:06:55Z

Ah,good catch, that is yet another one to fix! Because the one I meant was

In [23]: pd.Index([(0,2), (1,3)], dtype=object)
Out[23]: 
MultiIndex(levels=[[0, 1], [2, 3]],
           labels=[[0, 1], [0, 1]])

TomAugspurger added API Design Dtype Conversions Unexpected or buggy dtype conversions MultiIndex labels Aug 14, 2017

TomAugspurger mentioned this issue Aug 14, 2017

API: Have MultiIndex consturctors always return a MI #17236

Merged

TomAugspurger mentioned this issue Aug 21, 2017

BUG: Thoroughly dedup columns in read_csv #17060

Merged

TomAugspurger added Difficulty Intermediate labels Aug 22, 2017

jorisvandenbossche added this to the 0.21.0 milestone Aug 22, 2017

jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

jorisvandenbossche mentioned this issue Oct 17, 2017

DEPR: Deprecate tupleize_cols in Index constructor #17899

Closed

charlie0389 mentioned this issue Mar 30, 2018

BUG: #19497 FIX. Add tupleize_cols option to internals._transform_index() #20526

Closed

TomAugspurger mentioned this issue Mar 30, 2018

Bug: rename incapable of accepting tuples as new name #19497

Closed

toobaz mentioned this issue Jun 29, 2019

BUG: pandas.Index takes multidimensional array as input #20285

Open

jschendel mentioned this issue Jun 29, 2019

BUG: Index constructor should not allow an ndarray with ndim > 2 #27125

Closed

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019

jbrockmendel removed Effort Medium labels Oct 21, 2019

This was referenced Jan 1, 2020

REF: separate casting out of Index.__new__ #30586

Merged

BUG: Index.__new__ with Interval/Period data and object dtype #30635

Merged

mroeschke removed the API Design label Jun 12, 2021

mroeschke added the Enhancement label Jun 12, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel mentioned this issue Oct 28, 2022

API: Index vs Series constructor alignment #49372

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index Constructors inferring output from data #17246

Index Constructors inferring output from data #17246

TomAugspurger commented Aug 14, 2017 •

edited

Loading

shoyer commented Aug 14, 2017

TomAugspurger commented Aug 14, 2017

shoyer commented Aug 14, 2017

jorisvandenbossche commented Aug 16, 2017

TomAugspurger commented Aug 21, 2017

jreback commented Aug 22, 2017

jorisvandenbossche commented Aug 22, 2017

jreback commented Aug 22, 2017

jorisvandenbossche commented Aug 22, 2017

Index Constructors inferring output from data #17246

Index Constructors inferring output from data #17246

Comments

TomAugspurger commented Aug 14, 2017 • edited Loading

Consolidate all inference to the Index constructor

Passing dtype=object disables inference

shoyer commented Aug 14, 2017

TomAugspurger commented Aug 14, 2017

shoyer commented Aug 14, 2017

jorisvandenbossche commented Aug 16, 2017

TomAugspurger commented Aug 21, 2017

jreback commented Aug 22, 2017

jorisvandenbossche commented Aug 22, 2017

jreback commented Aug 22, 2017

jorisvandenbossche commented Aug 22, 2017

TomAugspurger commented Aug 14, 2017 •

edited

Loading

Consolidate all inference to the `Index` constructor

Passing `dtype=object` disables inference