Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/TST: Fix infer_dtype for Period array-likes and general ExtensionArrays #37367

Merged
merged 15 commits into from
Feb 12, 2021

Conversation

jorisvandenbossche
Copy link
Member

Closes #23553

In addition, I also changed to let infer_dtype fall back to inferring from the scalar elements if the array-like is not recognized directly (now it raises an error, which doesn't seem very useful?)

@jorisvandenbossche jorisvandenbossche added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 23, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.2 milestone Oct 23, 2020

# its ndarray-like but we can't handle
raise ValueError(f"cannot infer type for {type(value)}")
values = np.asarray(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests that get here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base extension test and the decimal test that I added in this PR get here (those were raising errors before)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I am not fully sure about this change. Before it raised an error, now it will typically return "mixed" for random python objects. Both don't seem very useful (or in other words, both are an indication that nothing could be inferred). But I found it a bit inconsistent to raise in this case, while if you would pass a list of the same objects, we would actually infer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are already doing this below. This does a full conversion of the object, I think this will simply kill performance in some cases making this routine very unpredictable. would rather not make this change here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random python objects will be marked as 'mixed' in any event w/o the performance panelty below.

if val in _TYPE_MAP:
return _TYPE_MAP[val]
# also check base name for parametrized dtypes (eg period[D])
if isinstance(val, str):
val = val.split("[")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we explicity add the Period[D] types to the type_map (there aren't that many)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going from the PeriodDtypeCode enum, there are actually quite some?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we were to get that specific, at some point it makes more sense to just return a dtype object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right here why can't we try to convert to an EA type using registry.find()?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we already have a dtype here, there is no need to look anything up in the registry. We don't want to infer a type, we have an EA dtype that we want to find in the TYPE_MAP to infer the category returned by infer_dtype, by checking the name, kind or base attributes of the dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to import those in lib.pyx ? (might be fine, eg there are already imports from tslibs.period, but at the moment only cimport s)

If that's OK, I am fine with changing it. I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already import all of the scalar EA types?
why is this any different

+1 on using the existing machinery

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

100% agree on both points

Do we want to import those in lib.pyx

not ideal, but i dont think it will harm anything ATM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please don't say that I should convert the string to a dtype, as you can see in the code a few lines above, we actually start from a dtype object)

exactly you are missing the entire point of the dtype abstraction

you avoid parsing strings in the first place

i will be blocking this util / unless a good soln is done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already import all of the scalar EA types?

lib.pyx doesn't know anything about EAs. It only imports helper functions like is_period_object

why is this any different

different than what?


# its ndarray-like but we can't handle
raise ValueError(f"cannot infer type for {type(value)}")
values = np.asarray(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are already doing this below. This does a full conversion of the object, I think this will simply kill performance in some cases making this routine very unpredictable. would rather not make this change here.


# its ndarray-like but we can't handle
raise ValueError(f"cannot infer type for {type(value)}")
values = np.asarray(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random python objects will be marked as 'mixed' in any event w/o the performance panelty below.

@jreback jreback removed this from the 1.2 milestone Oct 31, 2020

# its ndarray-like but we can't handle
raise ValueError(f"cannot infer type for {type(value)}")
values = np.asarray(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are going to remove the exception then you can remove this L1337,1338 entirely (as np.asarray is called on L1341)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are going to remove the exception then you can remove this L1337,1338 entirely (as np.asarray is called on L1341)

I removed the duplicate asarray, but so we should still decide if we want to keep that exception or not

@jreback jreback added this to the 1.2 milestone Nov 15, 2020
@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

@jorisvandenbossche if you can merge master and update for comments

@jorisvandenbossche
Copy link
Member Author

random python objects will be marked as 'mixed' in any event w/o the performance panelty below.

We could also simply return "mixed" as a kind of "unknown" instead of converting to a numpy array (or instead of the original exception).

@jorisvandenbossche
Copy link
Member Author

So the inconsistency that I thought to solve by removing the exception is this difference:

In [1]: from pandas.tests.extension.decimal import DecimalArray, make_data

In [2]: arr = DecimalArray(make_data())

In [3]: pd.api.types.infer_dtype(arr)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-782ef3c80fa8> in <module>
----> 1 pd.api.types.infer_dtype(arr)

pandas/_libs/lib.pyx in pandas._libs.lib.infer_dtype()

pandas/_libs/lib.pyx in pandas._libs.lib._try_infer_map()

AttributeError: 'DecimalDtype' object has no attribute 'base'

In [4]: pd.api.types.infer_dtype(list(arr))
Out[4]: 'decimal'

(with the other change, the exact exception would change to "ValueError: cannot infer type for DecimalArray")

Now, of course, when passing a custom EA (unknown to pandas), infer_dtype can actually be costly with this PR because it will do a conversion to numpy array (which can be costly for an EA). While before it raised an error before converting to numpy.

So it might make sense to keep the exception.
But that in general makes the usage of infer_dtype also annoying, as anywhere we use it, it can potentially raise an error (which we currently don't really account for). So returning the fallback "mixed" could also be an option instead of converting to a numpy array, since "mixed" is already something we handle.

@jbrockmendel
Copy link
Member

We could also simply return "mixed" as a kind of "unknown" instead of converting to a numpy array (or instead of the original exception).

I like this idea better than converting to ndarray. Maybe something other than "mixed" though, so as to keep the meaning of "mixed" unambiguous?

@jbrockmendel
Copy link
Member

My knee-jerk reaction to the DecimalArray case was "instead of casting to ndarray, we should just add 'decimal' to _TYPE_MAP". But that goes against the "we dont want to special-case our internal EAs any more than we have to" principle.

So two ideas:

  1. if we get an EA, can just return values.dtype.name or something like that
  2. part of the register_dtype process could add stuff to _TYPE_MAP (may just be a more complicated version of 1?)

@jreback
Copy link
Contributor

jreback commented Nov 20, 2020

yeah i think having a good path for EAs is ideal here.

@jorisvandenbossche
Copy link
Member Author

But that goes against the "we dont want to special-case our internal EAs any more than we have to" principle.

Note that DecimalArray isn't even an internal EA, it's a test case for an external EA (as long as we don't have a proper decimal dtype, at least ;))
Because we are already special casing our internal EAs here, since those are present in the _TYPE_MAP (and which I think is no problem to do here).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, i am -0 on trying to directly parse the string here as we already have machinery to parse dtypes, is there a reason you are trying to do it this way?

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

So I did a rough search / inventory of the different use cases internally of infer_dtypes. The main groups I see:

  • Many use cases are to infer a specific subset (eg "string", or "floating"/"integer"/"mixed-integer-float", or "boolean") from a list or object-dtype array
    -> since for those case we know we don't start with array with a specific dtype (except object type), this will never take the EA path
  • Infer "period"/"interval"/"datetime"/.. dtype from a non-EA (again not impacted by this discussion)
  • Infer "mixed-integer" for sorting (also not impacted by this discussion)
  • Infer dtype from an np.ndarray (idem)
  • Infer "integer" key type for indexing
    -> here we can potentially pass any EA (once we can use them for indexing), so for this use case it is actually important that infer_dtype(EA) doesn't raise an error (and the actual return value then doesn't matter for non-integer EAs)
  • ...

So from that, I think that it will actually be good to change that infer_dtype(EA) never raises an error (as it does now on master for unknown array types).

The question is then which value? Infer by converting to object dtype numpy array, return an existing value "mixed", or return a new value like "unknown-array" ? Or let the EA dtype register something?

Given the potential expensive nature of coercing to object dtype, that might be something to avoid.
Given that "mixed" is already being used and has some use cases (eg the validation for the str accessor), it might be better to not re-use that.

So two ideas:

  1. if we get an EA, can just return values.dtype.name or something like that
  2. part of the register_dtype process could add stuff to _TYPE_MAP (may just be a more complicated version of 1?)

If we let the EA control this, I think it only can make sense if they return one of the existing categories? (what would we otherwise ever do with it, except ignoring it?)
So that would rule out the first option, I think?

Long term, it might be useful to let the dtype register its "inferred_dtype", but then I think we should first have a better idea of some specific use cases for which this would be used/useful (currently, many use case I checked was to do some dtype inference when not yet having an array like to start with).

So on the short term, maybe we can use the "unknown-array" return value? That would also not be used in practice, so it would mean it is basically ignored, but then at least without raising an error.

@jbrockmendel
Copy link
Member

So on the short term, maybe we can use the "unknown-array" return value? That would also not be used in practice, so it would mean it is basically ignored, but then at least without raising an error.

+1

@jreback
Copy link
Contributor

jreback commented Nov 24, 2020

So on the short term, maybe we can use the "unknown-array" return value? That would also not be used in practice, so it would mean it is basically ignored, but then at least without raising an error.

+1

yep aggeed let's do this for now.

Is there a reason we cannot call registry.find() right now in this PR?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 24, 2020

Is there a reason we cannot call registry.find() right now in this PR?

What do you mean exactly?
The registry is to convert a string into a type AFAIK, while here we want to know the general "type" given a dtype

@jreback
Copy link
Contributor

jreback commented Nov 24, 2020

Is there a reason we cannot call registry.find() right now in this PR?

What do you mean exactly?
The registry is to convert a string into a type AFAIK, while here we want to know the general "type" given a dtype

sure once we have the actual dtype object, then we effectively have what's in _TYPE_MAP

@jorisvandenbossche
Copy link
Member Author

Ah, OK, but so in the case of an ExtensionArray being passed (the case under discussion), we already have a type object

@jorisvandenbossche
Copy link
Member Author

Updated this to return "unknown-array" for ExtensionArrays we don't have in our TYPE_MAP

@jorisvandenbossche
Copy link
Member Author

DatetimeTZ also includes the parametrization in the name:

In [95]: pd.DatetimeTZDtype(tz="UTC")
Out[95]: datetime64[ns, UTC]

In [96]: pd.DatetimeTZDtype(tz="UTC").name
Out[96]: 'datetime64[ns, UTC]'

(so it might be interval that is the outlier)

@jbrockmendel
Copy link
Member

Do we want to import [Interval, Decimal, Period] in lib.pyx ? (might be fine, eg there are already imports from tslibs.period, but at the moment only cimport s)

If that's OK, I am fine with changing it. I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

We already have Decimal in the namespace, and I think importing Interval and Period would be pretty benign (no circular dependencies). I'd be happy with this solution.

@jreback
Copy link
Contributor

jreback commented Dec 16, 2020

In many cases, but not in all. That's the whole point of this PR (apart from adding "period" to the TYPE_MAP dict): we already check the dtype's name, but that doesn't work for period, and doesn't work in general for parametrized dtypes.

We seem to be a bit inconsistent on whether we include the parametrization in the name of the dtype or not. For example:

In [87]: pd.PeriodDtype("D")
Out[87]: period[D]

In [88]: str(pd.PeriodDtype("D"))
Out[88]: 'period[D]'

In [89]: pd.PeriodDtype("D").name
Out[89]: 'period[D]'

In [90]: pd.IntervalDtype(np.int64)
Out[90]: interval[int64]

In [91]: str(pd.IntervalDtype(np.int64))
Out[91]: 'interval[int64]'

In [92]: pd.IntervalDtype(np.int64).name
Out[92]: 'interval'

So for Interval dtype, the "name" check already works, for Period not.

Maybe we should change PeriodDtype.name, though. But not directly sure what the impact of that would be.

i c, yeah maybe let's just fix name to be what i am suggesting as 'base_name'

@jorisvandenbossche
Copy link
Member Author

i c, yeah maybe let's just fix name to be what i am suggesting as 'base_name'

I am not sure we can "just" fix that. It's not just Period dtype, but basically any parametrized dtype (except for interval) does this. I don't directly know what the impact would be when changing all their names (it's public interface, so it would eg also be a breaking change)

@jreback
Copy link
Contributor

jreback commented Dec 17, 2020

i c, yeah maybe let's just fix name to be what i am suggesting as 'base_name'

I am not sure we can "just" fix that. It's not just Period dtype, but basically any parametrized dtype (except for interval) does this. I don't directly know what the impact would be when changing all their names (it's public interface, so it would eg also be a breaking change)

ok, then let's add .base_name for now and open an issue to see if we can normalize .name to be equivalent to str(dtype)

@jbrockmendel
Copy link
Member

ok, then let's add .base_name for now and open an issue to see if we can normalize .name to be equivalent to str(dtype)

maybe generic_name? avoids confusion with .base.name

Longer term, we could require that parametrized dtypes have a shared generic .base dtype and return dtype.base.name here

@jreback
Copy link
Contributor

jreback commented Jan 11, 2021

ok, then let's add .base_name for now and open an issue to see if we can normalize .name to be equivalent to str(dtype)

maybe generic_name? avoids confusion with .base.name

Longer term, we could require that parametrized dtypes have a shared generic .base dtype and return dtype.base.name here

sgtm

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 15, 2021

I am bit hesitant to start adding new attributes to dtypes just for this (that requires a broader discussion on the different naming attributes of the dtypes IMO).

But so I implemented the idea of Brock to add Period itself in the type_map, and thus removed the custom "string parsing" block that caused the discussion.

@jreback jreback added this to the 1.3 milestone Jan 16, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. can you add a whatsnew note.

also the OP has an example for IntervalArray, can you add a test for that as well.

@@ -1079,6 +1080,7 @@ _TYPE_MAP = {
"timedelta64[ns]": "timedelta64",
"m": "timedelta64",
"interval": "interval",
Period: "period",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob need Interval here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interval is already handled by the "interval" string in the line above, see the non-inline comment with the link to PR were this was fixed before

@jorisvandenbossche
Copy link
Member Author

also the OP has an example for IntervalArray, can you add a test for that as well.

The interval case was already fixed before in #27653, which also already added specific infer_dtype tests for it.

@jorisvandenbossche
Copy link
Member Author

This is all green now, and I think all comments are addressed.

@jreback jreback merged commit 5a35050 into pandas-dev:master Feb 12, 2021
@jreback
Copy link
Contributor

jreback commented Feb 12, 2021

thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the test-infer-dtype branch February 12, 2021 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: lib.infer_type broken for IntervalArray / PeriodArray
3 participants