Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index Constructors inferring output from data #17246

Open
TomAugspurger opened this issue Aug 14, 2017 · 9 comments
Open

Index Constructors inferring output from data #17246

TomAugspurger opened this issue Aug 14, 2017 · 9 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Enhancement MultiIndex

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 14, 2017

Two proposals:

Consolidate all inference to the Index constructor

Passing dtype=object disables inference

Index(..., dtype=object) disable all inference. So Index([1, 2], dtype=object) will give you an Index instead of Int64Index, and Index([(1, 'a'), (2, 'b')], dtype=object) an Index instead of MultiIndex, etc.

(original post follows)


Or how much magic should we have in the Index constructors? Currently we infer the index type from the data, which is often convenient, but sometime difficult to reason able behavior. e.g. hash_tuples currently doesn't work if your tuples all happen to be length 1, since it uses a MultiIndex internally.

Do we want to make our Index constructors more predictable? For reference, here are some examples:

>>> import pandas as pd
# 1.) Index -> MultiIndex
>>> pd.Index([(1, 2), (3, 4)])
MultiIndex(levels=[[1, 3], [2, 4]],
           labels=[[0, 1], [0, 1]])

>>> pd.Index([(1, 2), (3, 4)], tupleize_cols=False)
Index([(1, 2), (3, 4)], dtype='object')

# 2.) Index -> Int64Index
>>> pd.Index([1, 2, 3, 4, 5])
Int64Index([1, 2, 3, 4, 5], dtype='int64')

# 3.) Index -> RangeIndex
>>> pd.Index(range(1, 5))
RangeIndex(start=1, stop=5, step=1)

# 4.) Index -> DatetimeIndex
>>> pd.Index([pd.Timestamp('2017'), pd.Timestamp('2018')])
DatetimeIndex(['2017-01-01', '2018-01-01'], dtype='datetime64[ns]', freq=None)

# 5.) Index -> IntervalIndex
>>> pd.Index([pd.Interval(3, 4), pd.Interval(4, 5)])
IntervalIndex([(3, 4], (4, 5]]
              closed='right',
              dtype='interval[int64]')

# 5.) MultiIndex -> Index
>>> pd.MultiIndex.from_tuples([(1,), (2,), (3,)])
Int64Index([1, 2, 3], dtype='int64')

Of these, I think the first (Index -> MultiIndex if you have tuples) and the last (MultiIndex -> Index if you're tuples are all length 1) are undesirable. The Index -> MultiIndex one has the tupleize_cols keyword to control this behavior. In #17236 I add an analogous keyword to the MI constructor. The rest are probably fine, but I don't have any real reason for saying that [1, 2, 3] magically returning an Int64Index is ok, but [(1, 2), (3, 4)] returning a MI isn't (maybe the difference between a MI and Index is larger than the difference between an Int64Index and Index?). I believe that in either the RangeIndex or IntervalIndex someone (@shoyer?) had objections to overloading the Index constructor to return the specialized type.

So, what should we do about these? Leave them as is? Deprecate the type inference? My vote is for merging #17236 and leaving everything else as is. To me, it's not worth breaking API over.

cc @jreback, @jorisvandenbossche, @shoyer

@shoyer
Copy link
Member

shoyer commented Aug 14, 2017

I like how the generic pandas.Index() constructor to do type inference. This is convenient and usually helpful. (Though if we were starting over from scratch, I might encourage creating a separate constructor function, e.g., pd.as_index() to do type inference.)

I'm not a big fan of separate keyword arguments like tupleize_cols and squeeze. Instead, I would suggest:

  • Setting dtype on pd.Index controls the type of index created, e.g., dtype=object guarantees a base Index object.
  • I think we should simply remove outright the MultiIndex -> Index behavior with length one tuples. I understand the intention of avoiding multi-indexes with one level, but the inconsistency is grating. Every time I've encountered this, I've either used another work-around to create a MultiIndex or reverted to the base Index constructor.

@TomAugspurger
Copy link
Contributor Author

I think we should simply remove outright the MultiIndex -> Index

OK, since both you and @jreback are in favor of that over a keyword, I'll amend #17236 to just remove that behavior (without a deprecation cycle I suppose?).

And just to be clear @shoyer, you're in favor of keeping Index([('a, 'b'), ('c', 'd')]) returning a MI, since that's consistent with Index doing inference?

@shoyer
Copy link
Member

shoyer commented Aug 14, 2017

OK, since both you and @jreback are in favor of that over a keyword, I'll amend #17236 to just remove that behavior (without a deprecation cycle I suppose?).

Sounds good to me. This should be mentioned as a breaking change, of course. I can't think of any real use cases for this behavior, but I'm sure this will still come up for someone somehow!

And just to be clear @shoyer, you're in favor of keeping Index([('a, 'b'), ('c', 'd')]) returning a MI, since that's consistent with Index doing inference?

Yes, that seems consistent to me. I would suggest pd.Index([('a', 'b'), ('c', 'd')], dtype=object) as a good well to spell creating an Index of tuples.

@jorisvandenbossche
Copy link
Member

See #17236 (comment), I am also +1 on making the MultiIndex constructors consistently return MultiIndex (so remove the MultiIndex -> Index way)

@TomAugspurger
Copy link
Contributor Author

So it seems like the consensus is to put all the inference into Index and remove it from others (specifically MultiIndex of all length-1 tuples will be a MultiIndex instead of an Index).

The second idea is to have Index(..., dtype=object) disable all inference. So Index([1, 2], dtype=object) will give you an Index instead of Int64Index, and Index([(1, 'a'), (2, 'b')], dtype=object) an Index instead of MultiIndex, etc.

I'll update the top post.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2017

this already does what you expect

In [2]: pd.Index([0, 1], dtype=object)
Out[2]: Index([0, 1], dtype='object')

but eliminating the special case is important as well.

@jorisvandenbossche
Copy link
Member

@TomAugspurger yep, agree with your summary

The dtype=object to disable interference indeed already works for most data, but not yet for tuples (they still become a MultiIndex). So would be good to fix that.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2017

ahh I would be ok with fixing this. This is not respecting the dtype.

In [1]: pd.Index([(0,), (1,)], dtype=object)
Out[1]: Int64Index([0, 1], dtype='int64')

@jorisvandenbossche
Copy link
Member

Ah,good catch, that is yet another one to fix! Because the one I meant was

In [23]: pd.Index([(0,2), (1,3)], dtype=object)
Out[23]: 
MultiIndex(levels=[[0, 1], [2, 3]],
           labels=[[0, 1], [0, 1]])

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Enhancement MultiIndex
Projects
None yet
Development

No branches or pull requests

6 participants