BUG/TST: Fix infer_dtype for Period array-likes and general ExtensionArrays #37367

jorisvandenbossche · 2020-10-23T19:07:37Z

In addition, I also changed to let infer_dtype fall back to inferring from the scalar elements if the array-like is not recognized directly (now it raises an error, which doesn't seem very useful?)

…Arrays

jbrockmendel · 2020-10-23T22:02:05Z

pandas/_libs/lib.pyx


            # its ndarray-like but we can't handle
-            raise ValueError(f"cannot infer type for {type(value)}")
+            values = np.asarray(value)


do we have tests that get here?

The base extension test and the decimal test that I added in this PR get here (those were raising errors before)

Note I am not fully sure about this change. Before it raised an error, now it will typically return "mixed" for random python objects. Both don't seem very useful (or in other words, both are an indication that nothing could be inferred). But I found it a bit inconsistent to raise in this case, while if you would pass a list of the same objects, we would actually infer

we are already doing this below. This does a full conversion of the object, I think this will simply kill performance in some cases making this routine very unpredictable. would rather not make this change here.

random python objects will be marked as 'mixed' in any event w/o the performance panelty below.

jreback · 2020-10-31T18:42:30Z

pandas/_libs/lib.pyx

        if val in _TYPE_MAP:
            return _TYPE_MAP[val]
+        # also check base name for parametrized dtypes (eg period[D])
+        if isinstance(val, str):
+            val = val.split("[")[0]


can we explicity add the Period[D] types to the type_map (there aren't that many)

Going from the PeriodDtypeCode enum, there are actually quite some?

if we were to get that specific, at some point it makes more sense to just return a dtype object

right here why can't we try to convert to an EA type using registry.find()?

Because we already have a dtype here, there is no need to look anything up in the registry. We don't want to infer a type, we have an EA dtype that we want to find in the TYPE_MAP to infer the category returned by infer_dtype, by checking the name, kind or base attributes of the dtype.

Do we want to import those in lib.pyx ? (might be fine, eg there are already imports from tslibs.period, but at the moment only cimport s)

If that's OK, I am fine with changing it. I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

do we not already import all of the scalar EA types?
why is this any different

+1 on using the existing machinery

I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

100% agree on both points

Do we want to import those in lib.pyx

not ideal, but i dont think it will harm anything ATM

and please don't say that I should convert the string to a dtype, as you can see in the code a few lines above, we actually start from a dtype object)

exactly you are missing the entire point of the dtype abstraction

you avoid parsing strings in the first place

i will be blocking this util / unless a good soln is done

do we not already import all of the scalar EA types?

lib.pyx doesn't know anything about EAs. It only imports helper functions like is_period_object

why is this any different

different than what?

jreback · 2020-10-31T18:44:06Z

pandas/_libs/lib.pyx


            # its ndarray-like but we can't handle
-            raise ValueError(f"cannot infer type for {type(value)}")
+            values = np.asarray(value)


we are already doing this below. This does a full conversion of the object, I think this will simply kill performance in some cases making this routine very unpredictable. would rather not make this change here.

jreback · 2020-10-31T18:44:35Z

pandas/_libs/lib.pyx


            # its ndarray-like but we can't handle
-            raise ValueError(f"cannot infer type for {type(value)}")
+            values = np.asarray(value)


random python objects will be marked as 'mixed' in any event w/o the performance panelty below.

jreback · 2020-11-04T02:23:36Z

pandas/_libs/lib.pyx


            # its ndarray-like but we can't handle
-            raise ValueError(f"cannot infer type for {type(value)}")
+            values = np.asarray(value)


if you are going to remove the exception then you can remove this L1337,1338 entirely (as np.asarray is called on L1341)

if you are going to remove the exception then you can remove this L1337,1338 entirely (as np.asarray is called on L1341)

I removed the duplicate asarray, but so we should still decide if we want to keep that exception or not

pandas/_libs/lib.pyx

jreback · 2020-11-18T19:03:47Z

@jorisvandenbossche if you can merge master and update for comments

jorisvandenbossche · 2020-11-20T20:08:32Z

random python objects will be marked as 'mixed' in any event w/o the performance panelty below.

We could also simply return "mixed" as a kind of "unknown" instead of converting to a numpy array (or instead of the original exception).

jorisvandenbossche · 2020-11-20T20:15:37Z

So the inconsistency that I thought to solve by removing the exception is this difference:

In [1]: from pandas.tests.extension.decimal import DecimalArray, make_data

In [2]: arr = DecimalArray(make_data())

In [3]: pd.api.types.infer_dtype(arr)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-782ef3c80fa8> in <module>
----> 1 pd.api.types.infer_dtype(arr)

pandas/_libs/lib.pyx in pandas._libs.lib.infer_dtype()

pandas/_libs/lib.pyx in pandas._libs.lib._try_infer_map()

AttributeError: 'DecimalDtype' object has no attribute 'base'

In [4]: pd.api.types.infer_dtype(list(arr))
Out[4]: 'decimal'

(with the other change, the exact exception would change to "ValueError: cannot infer type for DecimalArray")

Now, of course, when passing a custom EA (unknown to pandas), infer_dtype can actually be costly with this PR because it will do a conversion to numpy array (which can be costly for an EA). While before it raised an error before converting to numpy.

So it might make sense to keep the exception.
But that in general makes the usage of infer_dtype also annoying, as anywhere we use it, it can potentially raise an error (which we currently don't really account for). So returning the fallback "mixed" could also be an option instead of converting to a numpy array, since "mixed" is already something we handle.

jbrockmendel · 2020-11-20T21:14:35Z

We could also simply return "mixed" as a kind of "unknown" instead of converting to a numpy array (or instead of the original exception).

I like this idea better than converting to ndarray. Maybe something other than "mixed" though, so as to keep the meaning of "mixed" unambiguous?

jbrockmendel · 2020-11-20T21:22:54Z

My knee-jerk reaction to the DecimalArray case was "instead of casting to ndarray, we should just add 'decimal' to _TYPE_MAP". But that goes against the "we dont want to special-case our internal EAs any more than we have to" principle.

So two ideas:

if we get an EA, can just return values.dtype.name or something like that
part of the register_dtype process could add stuff to _TYPE_MAP (may just be a more complicated version of 1?)

jreback · 2020-11-20T23:13:20Z

yeah i think having a good path for EAs is ideal here.

jorisvandenbossche · 2020-11-21T08:45:45Z

But that goes against the "we dont want to special-case our internal EAs any more than we have to" principle.

Note that DecimalArray isn't even an internal EA, it's a test case for an external EA (as long as we don't have a proper decimal dtype, at least ;))
Because we are already special casing our internal EAs here, since those are present in the _TYPE_MAP (and which I think is no problem to do here).

jreback

small comment, i am -0 on trying to directly parse the string here as we already have machinery to parse dtypes, is there a reason you are trying to do it this way?

pandas/_libs/lib.pyx

jorisvandenbossche · 2020-11-24T15:07:10Z

So I did a rough search / inventory of the different use cases internally of infer_dtypes. The main groups I see:

Many use cases are to infer a specific subset (eg "string", or "floating"/"integer"/"mixed-integer-float", or "boolean") from a list or object-dtype array
-> since for those case we know we don't start with array with a specific dtype (except object type), this will never take the EA path
Infer "period"/"interval"/"datetime"/.. dtype from a non-EA (again not impacted by this discussion)
Infer "mixed-integer" for sorting (also not impacted by this discussion)
Infer dtype from an np.ndarray (idem)
Infer "integer" key type for indexing
-> here we can potentially pass any EA (once we can use them for indexing), so for this use case it is actually important that infer_dtype(EA) doesn't raise an error (and the actual return value then doesn't matter for non-integer EAs)
...

So from that, I think that it will actually be good to change that infer_dtype(EA) never raises an error (as it does now on master for unknown array types).

The question is then which value? Infer by converting to object dtype numpy array, return an existing value "mixed", or return a new value like "unknown-array" ? Or let the EA dtype register something?

Given the potential expensive nature of coercing to object dtype, that might be something to avoid.
Given that "mixed" is already being used and has some use cases (eg the validation for the str accessor), it might be better to not re-use that.

So two ideas:

if we get an EA, can just return values.dtype.name or something like that

part of the register_dtype process could add stuff to _TYPE_MAP (may just be a more complicated version of 1?)

If we let the EA control this, I think it only can make sense if they return one of the existing categories? (what would we otherwise ever do with it, except ignoring it?)
So that would rule out the first option, I think?

Long term, it might be useful to let the dtype register its "inferred_dtype", but then I think we should first have a better idea of some specific use cases for which this would be used/useful (currently, many use case I checked was to do some dtype inference when not yet having an array like to start with).

So on the short term, maybe we can use the "unknown-array" return value? That would also not be used in practice, so it would mean it is basically ignored, but then at least without raising an error.

jbrockmendel · 2020-11-24T15:37:36Z

So on the short term, maybe we can use the "unknown-array" return value? That would also not be used in practice, so it would mean it is basically ignored, but then at least without raising an error.

+1

jreback · 2020-11-24T16:27:22Z

So on the short term, maybe we can use the "unknown-array" return value? That would also not be used in practice, so it would mean it is basically ignored, but then at least without raising an error.

+1

yep aggeed let's do this for now.

Is there a reason we cannot call registry.find() right now in this PR?

jorisvandenbossche · 2020-11-24T16:29:03Z

Is there a reason we cannot call registry.find() right now in this PR?

What do you mean exactly?
The registry is to convert a string into a type AFAIK, while here we want to know the general "type" given a dtype

jreback · 2020-11-24T17:46:37Z

Is there a reason we cannot call registry.find() right now in this PR?

What do you mean exactly?
The registry is to convert a string into a type AFAIK, while here we want to know the general "type" given a dtype

sure once we have the actual dtype object, then we effectively have what's in _TYPE_MAP

jorisvandenbossche · 2020-11-24T19:25:14Z

Ah, OK, but so in the case of an ExtensionArray being passed (the case under discussion), we already have a type object

jorisvandenbossche · 2020-11-24T19:33:55Z

Updated this to return "unknown-array" for ExtensionArrays we don't have in our TYPE_MAP

jorisvandenbossche · 2020-12-16T20:04:35Z

DatetimeTZ also includes the parametrization in the name:

In [95]: pd.DatetimeTZDtype(tz="UTC")
Out[95]: datetime64[ns, UTC]

In [96]: pd.DatetimeTZDtype(tz="UTC").name
Out[96]: 'datetime64[ns, UTC]'

(so it might be interval that is the outlier)

jbrockmendel · 2020-12-16T22:10:36Z

Do we want to import [Interval, Decimal, Period] in lib.pyx ? (might be fine, eg there are already imports from tslibs.period, but at the moment only cimport s)

If that's OK, I am fine with changing it. I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

We already have Decimal in the namespace, and I think importing Interval and Period would be pretty benign (no circular dependencies). I'd be happy with this solution.

jreback · 2020-12-16T22:43:26Z

In many cases, but not in all. That's the whole point of this PR (apart from adding "period" to the TYPE_MAP dict): we already check the dtype's name, but that doesn't work for period, and doesn't work in general for parametrized dtypes.

We seem to be a bit inconsistent on whether we include the parametrization in the name of the dtype or not. For example:
In [87]: pd.PeriodDtype("D")
Out[87]: period[D]

In [88]: str(pd.PeriodDtype("D"))
Out[88]: 'period[D]'

In [89]: pd.PeriodDtype("D").name
Out[89]: 'period[D]'

In [90]: pd.IntervalDtype(np.int64)
Out[90]: interval[int64]

In [91]: str(pd.IntervalDtype(np.int64))
Out[91]: 'interval[int64]'

In [92]: pd.IntervalDtype(np.int64).name
Out[92]: 'interval'
So for Interval dtype, the "name" check already works, for Period not.

Maybe we should change PeriodDtype.name, though. But not directly sure what the impact of that would be.

i c, yeah maybe let's just fix name to be what i am suggesting as 'base_name'

jorisvandenbossche · 2020-12-17T22:51:28Z

i c, yeah maybe let's just fix name to be what i am suggesting as 'base_name'

I am not sure we can "just" fix that. It's not just Period dtype, but basically any parametrized dtype (except for interval) does this. I don't directly know what the impact would be when changing all their names (it's public interface, so it would eg also be a breaking change)

jreback · 2020-12-17T23:41:51Z

i c, yeah maybe let's just fix name to be what i am suggesting as 'base_name'

I am not sure we can "just" fix that. It's not just Period dtype, but basically any parametrized dtype (except for interval) does this. I don't directly know what the impact would be when changing all their names (it's public interface, so it would eg also be a breaking change)

ok, then let's add .base_name for now and open an issue to see if we can normalize .name to be equivalent to str(dtype)

jbrockmendel · 2021-01-11T19:01:56Z

ok, then let's add .base_name for now and open an issue to see if we can normalize .name to be equivalent to str(dtype)

maybe generic_name? avoids confusion with .base.name

Longer term, we could require that parametrized dtypes have a shared generic .base dtype and return dtype.base.name here

jreback · 2021-01-11T20:00:45Z

ok, then let's add .base_name for now and open an issue to see if we can normalize .name to be equivalent to str(dtype)

maybe generic_name? avoids confusion with .base.name

Longer term, we could require that parametrized dtypes have a shared generic .base dtype and return dtype.base.name here

sgtm

jorisvandenbossche · 2021-01-15T18:32:42Z

I am bit hesitant to start adding new attributes to dtypes just for this (that requires a broader discussion on the different naming attributes of the dtypes IMO).

But so I implemented the idea of Brock to add Period itself in the type_map, and thus removed the custom "string parsing" block that caused the discussion.

jreback

lgtm. can you add a whatsnew note.

also the OP has an example for IntervalArray, can you add a test for that as well.

jreback · 2021-01-16T01:30:13Z

pandas/_libs/lib.pyx

@@ -1079,6 +1080,7 @@ _TYPE_MAP = {
    "timedelta64[ns]": "timedelta64",
    "m": "timedelta64",
    "interval": "interval",
+    Period: "period",


prob need Interval here?

Interval is already handled by the "interval" string in the line above, see the non-inline comment with the link to PR were this was fixed before

jorisvandenbossche · 2021-01-18T10:18:37Z

also the OP has an example for IntervalArray, can you add a test for that as well.

The interval case was already fixed before in #27653, which also already added specific infer_dtype tests for it.

doc/source/whatsnew/v1.3.0.rst

jorisvandenbossche · 2021-02-12T16:38:00Z

This is all green now, and I think all comments are addressed.

pandas/tests/dtypes/test_inference.py

jreback · 2021-02-12T17:45:30Z

thanks @jorisvandenbossche

BUG/TST: Fix infer_dtype for Period array-likes and general Extension…

a3b7ad7

…Arrays

jorisvandenbossche added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 23, 2020

jorisvandenbossche added this to the 1.2 milestone Oct 23, 2020

jbrockmendel reviewed Oct 23, 2020

View reviewed changes

jreback requested changes Oct 31, 2020

View reviewed changes

jreback removed this from the 1.2 milestone Oct 31, 2020

jreback requested changes Nov 4, 2020

View reviewed changes

jbrockmendel reviewed Nov 15, 2020

View reviewed changes

pandas/_libs/lib.pyx Show resolved Hide resolved

jreback added this to the 1.2 milestone Nov 15, 2020

jbrockmendel mentioned this pull request Nov 15, 2020

ENH/POC: ExtensionIndex for arbitrary EAs #37869

Closed

5 tasks

jorisvandenbossche added 2 commits November 20, 2020 20:56

Merge remote-tracking branch 'upstream/master' into test-infer-dtype

6aa4475

remove duplicate asarray

d6f6c0a

temp

50d7425

jreback requested changes Nov 24, 2020

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

jorisvandenbossche added 2 commits November 24, 2020 20:25

Merge remote-tracking branch 'upstream/master' into test-infer-dtype

ed36be2

return unknown-array

ce64b64

jbrockmendel mentioned this pull request Dec 16, 2020

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

Closed

2 tasks

jbrockmendel mentioned this pull request Jan 12, 2021

ENH: Support ExtensionArray (and masked EAs speficially) in indexing #39133

Open

jorisvandenbossche added 2 commits January 15, 2021 19:11

Merge remote-tracking branch 'upstream/master' into test-infer-dtype

0e15a9a

add Period class in type map + check dtype.type

688104f

jreback added this to the 1.3 milestone Jan 16, 2021

jreback requested changes Jan 16, 2021

View reviewed changes

jreback reviewed Jan 16, 2021

View reviewed changes

jorisvandenbossche added 2 commits January 18, 2021 11:20

Merge remote-tracking branch 'upstream/master' into test-infer-dtype

f5ea183

add whatsnew

4c539e3

jorisvandenbossche commented Jan 18, 2021

View reviewed changes

doc/source/whatsnew/v1.3.0.rst Show resolved Hide resolved

jorisvandenbossche added 5 commits February 8, 2021 13:40

Merge remote-tracking branch 'upstream/master' into test-infer-dtype

0504ad1

fixup usage of pd.api.type

88dcd81

Merge remote-tracking branch 'upstream/master' into test-infer-dtype

b53f46e

fix error message from fastparquet

100b83b

Merge remote-tracking branch 'upstream/master' into test-infer-dtype

fbb713d

jreback reviewed Feb 12, 2021

View reviewed changes

pandas/tests/dtypes/test_inference.py Show resolved Hide resolved

jreback approved these changes Feb 12, 2021

View reviewed changes

jreback merged commit 5a35050 into pandas-dev:master Feb 12, 2021

jorisvandenbossche deleted the test-infer-dtype branch February 12, 2021 17:47

BUG/TST: Fix infer_dtype for Period array-likes and general ExtensionArrays #37367

BUG/TST: Fix infer_dtype for Period array-likes and general ExtensionArrays #37367

Conversation

jorisvandenbossche commented Oct 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Nov 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 18, 2020

jorisvandenbossche commented Nov 20, 2020

jorisvandenbossche commented Nov 20, 2020

jbrockmendel commented Nov 20, 2020

jbrockmendel commented Nov 20, 2020

jreback commented Nov 20, 2020

jorisvandenbossche commented Nov 21, 2020

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 24, 2020

jbrockmendel commented Nov 24, 2020

jreback commented Nov 24, 2020

jorisvandenbossche commented Nov 24, 2020 • edited Loading

jreback commented Nov 24, 2020

jorisvandenbossche commented Nov 24, 2020

jorisvandenbossche commented Nov 24, 2020

jorisvandenbossche commented Dec 16, 2020

jbrockmendel commented Dec 16, 2020

jreback commented Dec 16, 2020

jorisvandenbossche commented Dec 17, 2020

jreback commented Dec 17, 2020

jbrockmendel commented Jan 11, 2021

jreback commented Jan 11, 2021

jorisvandenbossche commented Jan 15, 2021 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 18, 2021

jorisvandenbossche commented Feb 12, 2021

jreback commented Feb 12, 2021

jorisvandenbossche Nov 25, 2020 •

edited

Loading

jorisvandenbossche commented Nov 24, 2020 •

edited

Loading

jorisvandenbossche commented Jan 15, 2021 •

edited

Loading