Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: numeric inference in Series constructor #40489

Closed
jbrockmendel opened this issue Mar 17, 2021 · 6 comments · Fixed by #42870
Closed

API: numeric inference in Series constructor #40489

jbrockmendel opened this issue Mar 17, 2021 · 6 comments · Fixed by #42870
Labels
API - Consistency Internal Consistency of API/Behavior Bug Index Related to the Index class or subclasses Series Series data structure
Milestone

Comments

@jbrockmendel
Copy link
Member

Index.__new__ does inference on numeric data more aggressively than Series.__new__. It would be nice if these behaviors matched. (xref #40451 also about differenced between Series vs Index inference, though in that case Series is more aggressive about inference)

data = np.array([np.nan, np.nan, 2.0], dtype=object)

>>> pd.Series(data).dtype
dtype('O')

>>> pd.Index(data).dtype
dtype('float64')

>>> pd.array(data).dtype
Float64Dtype()

(if we passed data[:-1] to pd.array we'd get back a PandasArray[object] bc it passes skipna=True to lib.infer_dtype)

Changing the Series behavior to match Index breaks 207 tests (140 of which are for the str accessor; i expect some others are false-positives), so this would be a non-trivial API change.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 17, 2021
@rhshadrach
Copy link
Member

Is the current Index behavior more desirable than Series? Just considering the example, I would expect the Series result.

@rhshadrach rhshadrach added Index Related to the Index class or subclasses Series Series data structure API - Consistency Internal Consistency of API/Behavior and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 19, 2021
@jbrockmendel
Copy link
Member Author

I find the Index behavior more useful, but would be OK with either way. Mostly I want them to be consistent.

@jorisvandenbossche
Copy link
Member

You are specifically mentioning inferring numeric object dtype. But so there is a reason to infer object dtype in general, for scalars that otherwise have no numpy equivalent. It would find it also be a bit strange to infer object dtype depending on the content of it (i.e. if it turns out to be numeric, leave as object dtype)?

@jbrockmendel
Copy link
Member Author

You are specifically mentioning inferring numeric object dtype.

The relevant cases where the behaviors differ are numeric and datetimelike, the latter is covered in #40451.

@jbrockmendel
Copy link
Member Author

After experimenting with both possible changes (making Series behavior match Index or making Index behavior match Series), I'm now leaning towards preferring the Series behavior, i.e. inferring less aggressively.

In the branch that makes Series infer more, I've still got 49 test failures. The other branch I'm down to 3 (though with a ton of warnings to catch). In each case, some unrelated bugs surfaced that I'll try to address separately.

A couple of sticking points with the make-Index-less-aggressive option: 1) In the CategoricalDtype constructor we call Index and I think we do want the more aggressive casting there. 2) ensure_index casting is slightly different from Index; ATM ive kept it aggressive, but having non-matching behaviors isn't great.

@jreback
Copy link
Contributor

jreback commented Jul 9, 2021

yeah i think this is basically left-over from having Index trying to infer datetimelike strings to be an actual DTI. We don't ever want to do this implicitly anymore. So I would respect a passed in dtype (which I think we already do), and respect a dtyped array passed in as well and NOT infer. So would be +1 on deprecating the Index/pd.array inference paths here and going with Series behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Index Related to the Index class or subclasses Series Series data structure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants