-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/TST: Fix infer_dtype for Period array-likes and general ExtensionArrays #37367
BUG/TST: Fix infer_dtype for Period array-likes and general ExtensionArrays #37367
Conversation
pandas/_libs/lib.pyx
Outdated
|
||
# its ndarray-like but we can't handle | ||
raise ValueError(f"cannot infer type for {type(value)}") | ||
values = np.asarray(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have tests that get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base extension test and the decimal test that I added in this PR get here (those were raising errors before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I am not fully sure about this change. Before it raised an error, now it will typically return "mixed" for random python objects. Both don't seem very useful (or in other words, both are an indication that nothing could be inferred). But I found it a bit inconsistent to raise in this case, while if you would pass a list of the same objects, we would actually infer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are already doing this below. This does a full conversion of the object, I think this will simply kill performance in some cases making this routine very unpredictable. would rather not make this change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random python objects will be marked as 'mixed' in any event w/o the performance panelty below.
pandas/_libs/lib.pyx
Outdated
if val in _TYPE_MAP: | ||
return _TYPE_MAP[val] | ||
# also check base name for parametrized dtypes (eg period[D]) | ||
if isinstance(val, str): | ||
val = val.split("[")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we explicity add the Period[D] types to the type_map (there aren't that many)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going from the PeriodDtypeCode
enum, there are actually quite some?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we were to get that specific, at some point it makes more sense to just return a dtype object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right here why can't we try to convert to an EA type using registry.find()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we already have a dtype
here, there is no need to look anything up in the registry. We don't want to infer a type, we have an EA dtype
that we want to find in the TYPE_MAP to infer the category returned by infer_dtype
, by checking the name, kind or base attributes of the dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to import those in lib.pyx ? (might be fine, eg there are already imports from tslibs.period, but at the moment only cimport
s)
If that's OK, I am fine with changing it. I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not already import all of the scalar EA types?
why is this any different
+1 on using the existing machinery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all
100% agree on both points
Do we want to import those in lib.pyx
not ideal, but i dont think it will harm anything ATM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and please don't say that I should convert the string to a dtype, as you can see in the code a few lines above, we actually start from a dtype object)
exactly you are missing the entire point of the dtype abstraction
you avoid parsing strings in the first place
i will be blocking this util / unless a good soln is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not already import all of the scalar EA types?
lib.pyx doesn't know anything about EAs. It only imports helper functions like is_period_object
why is this any different
different than what?
pandas/_libs/lib.pyx
Outdated
|
||
# its ndarray-like but we can't handle | ||
raise ValueError(f"cannot infer type for {type(value)}") | ||
values = np.asarray(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are already doing this below. This does a full conversion of the object, I think this will simply kill performance in some cases making this routine very unpredictable. would rather not make this change here.
pandas/_libs/lib.pyx
Outdated
|
||
# its ndarray-like but we can't handle | ||
raise ValueError(f"cannot infer type for {type(value)}") | ||
values = np.asarray(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random python objects will be marked as 'mixed' in any event w/o the performance panelty below.
pandas/_libs/lib.pyx
Outdated
|
||
# its ndarray-like but we can't handle | ||
raise ValueError(f"cannot infer type for {type(value)}") | ||
values = np.asarray(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are going to remove the exception then you can remove this L1337,1338 entirely (as np.asarray is called on L1341)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are going to remove the exception then you can remove this L1337,1338 entirely (as np.asarray is called on L1341)
I removed the duplicate asarray
, but so we should still decide if we want to keep that exception or not
@jorisvandenbossche if you can merge master and update for comments |
We could also simply return "mixed" as a kind of "unknown" instead of converting to a numpy array (or instead of the original exception). |
So the inconsistency that I thought to solve by removing the exception is this difference:
(with the other change, the exact exception would change to "ValueError: cannot infer type for DecimalArray") Now, of course, when passing a custom EA (unknown to pandas), So it might make sense to keep the exception. |
I like this idea better than converting to ndarray. Maybe something other than "mixed" though, so as to keep the meaning of "mixed" unambiguous? |
My knee-jerk reaction to the DecimalArray case was "instead of casting to ndarray, we should just add 'decimal' to So two ideas:
|
yeah i think having a good path for EAs is ideal here. |
Note that DecimalArray isn't even an internal EA, it's a test case for an external EA (as long as we don't have a proper decimal dtype, at least ;)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, i am -0 on trying to directly parse the string here as we already have machinery to parse dtypes, is there a reason you are trying to do it this way?
So I did a rough search / inventory of the different use cases internally of
So from that, I think that it will actually be good to change that The question is then which value? Infer by converting to object dtype numpy array, return an existing value "mixed", or return a new value like "unknown-array" ? Or let the EA dtype register something? Given the potential expensive nature of coercing to object dtype, that might be something to avoid.
If we let the EA control this, I think it only can make sense if they return one of the existing categories? (what would we otherwise ever do with it, except ignoring it?) Long term, it might be useful to let the dtype register its "inferred_dtype", but then I think we should first have a better idea of some specific use cases for which this would be used/useful (currently, many use case I checked was to do some dtype inference when not yet having an array like to start with). So on the short term, maybe we can use the "unknown-array" return value? That would also not be used in practice, so it would mean it is basically ignored, but then at least without raising an error. |
+1 |
yep aggeed let's do this for now. Is there a reason we cannot call |
What do you mean exactly? |
sure once we have the actual dtype object, then we effectively have what's in _TYPE_MAP |
Ah, OK, but so in the case of an ExtensionArray being passed (the case under discussion), we already have a type object |
Updated this to return "unknown-array" for ExtensionArrays we don't have in our TYPE_MAP |
DatetimeTZ also includes the parametrization in the name:
(so it might be interval that is the outlier) |
We already have Decimal in the namespace, and I think importing Interval and Period would be pretty benign (no circular dependencies). I'd be happy with this solution. |
i c, yeah maybe let's just fix name to be what i am suggesting as 'base_name' |
I am not sure we can "just" fix that. It's not just Period dtype, but basically any parametrized dtype (except for interval) does this. I don't directly know what the impact would be when changing all their |
ok, then let's add |
maybe Longer term, we could require that parametrized dtypes have a shared generic |
sgtm |
I am bit hesitant to start adding new attributes to dtypes just for this (that requires a broader discussion on the different naming attributes of the dtypes IMO). But so I implemented the idea of Brock to add Period itself in the type_map, and thus removed the custom "string parsing" block that caused the discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. can you add a whatsnew note.
also the OP has an example for IntervalArray, can you add a test for that as well.
@@ -1079,6 +1080,7 @@ _TYPE_MAP = { | |||
"timedelta64[ns]": "timedelta64", | |||
"m": "timedelta64", | |||
"interval": "interval", | |||
Period: "period", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob need Interval here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interval is already handled by the "interval" string in the line above, see the non-inline comment with the link to PR were this was fixed before
The interval case was already fixed before in #27653, which also already added specific |
This is all green now, and I think all comments are addressed. |
thanks @jorisvandenbossche |
Closes #23553
In addition, I also changed to let infer_dtype fall back to inferring from the scalar elements if the array-like is not recognized directly (now it raises an error, which doesn't seem very useful?)