-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INT: provide helpers for accessing the values of DataFrame columns #33252
INT: provide helpers for accessing the values of DataFrame columns #33252
Conversation
Do you have ideas generally about places this is applicable? One potential spot could be in groupby code: pandas/pandas/core/groupby/generic.py Line 980 in 66e4517
Which right now will yield a 2D object when there are duplicate labels but maybe shouldn't Wondering if we should include the label as part of what is returned here |
In principle you can still do that rather easily manually, by doing The |
I've got a branch motivated by trying to optimize this, ended up making a @jorisvandenbossche do you have a good read on what part of the Series construction is causing the overhead? |
is_datetime64_any_dtype(values.dtype) or is_period_dtype(values.dtype) | ||
for values in self._iter_arrays() | ||
], | ||
dtype=bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the perf issue here is in the self.dtypes
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as much as im trying to avoid ._data
, might be cleaner to use self._data.get_dtypes
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the perf issue here is in the self.dtypes call?
Both the self.dtypes
that creates a Series, as the apply
that creates another Series with inference.
as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here
Indeed, that would be even more appropriate here.
Now, I am happy to change it to that, but that's not really the core of this PR. I mainly wanted to add one useful case as an illustration, but mainly want the helper function for other PRs. I am happy to merge it without already using it somewhere as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here
When having a dataframe with only extension blocks, it's actually slower than [v.dtype for v in df._iter_arrays()]
(for a dataframe with a 2D block it will of course be faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're just going to call .any()
on this anyway, could do the any up-front on the iterator, no big deal
I didn't check this time any more (looked at it previously), but here, when the only thing you need is an array, any overhead that the Series constructor gives is easily avoided overhead. It's not that this PR is adding complex code or so. I think it will be a useful helper function regularly (I added one use case here, and the two linked PRs can be two other use cases already). I suppose that the Series constructor overhead can be decreased (which would certainly be useful anyway), but it might require some additional keywords (eg a fastpath that avoids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok, are there some tests for this specifically?
I don't think tests are necessarily needed, since it's an internal helper function (most private methods don't have specific tests), but I can add a simple one if needed. |
pandas/core/frame.py
Outdated
""" | ||
return self._data.iget_values(i) | ||
|
||
def _iter_arrays(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this by typed as a Generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterator
, apparently (https://mypy.readthedocs.io/en/stable/kinds_of_types.html#generators), added that
pandas/core/frame.py
Outdated
@@ -2568,6 +2568,21 @@ def _ixs(self, i: int, axis: int = 0): | |||
|
|||
return result | |||
|
|||
def _ixs_values(self, i: int) -> Union[np.ndarray, ExtensionArray]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's name this _iget_values for consistency no? (ixs sounds like cross-section, e.g. rows which this is not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no _get_values
on DataFrame, so for DataFrame that gives no consistency, but of course more consistent with the block method. Anyway, I don't really have a preference, either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Union[np.ndarray, ExtensionArray]" should be "ArrayLike"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel do you have a preference for the name?
ok no big deal |
pandas/core/frame.py
Outdated
@@ -2568,6 +2569,21 @@ def _ixs(self, i: int, axis: int = 0): | |||
|
|||
return result | |||
|
|||
def _ixs_values(self, i: int) -> Union[np.ndarray, ExtensionArray]: | |||
""" | |||
Get the values of the ith column (ndarray or ExtensionArray, as stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ith" -> "i'th"
pandas/core/frame.py
Outdated
""" | ||
return self._data.iget_values(i) | ||
|
||
def _iter_arrays(self) -> Iterator[Union[np.ndarray, ExtensionArray]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Union[np.ndarray, ExtensionArray]" -> "ArrayLike:"
def _iter_arrays(self) -> Iterator[Union[np.ndarray, ExtensionArray]]: | ||
""" | ||
Iterate over the arrays of all columns in order. | ||
This returns the values as stored in the Block (ndarray or ExtensionArray). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think another newline?
pandas/core/internals/managers.py
Outdated
@@ -1001,6 +1002,14 @@ def iget(self, i: int) -> "SingleBlockManager": | |||
self.axes[1], | |||
) | |||
|
|||
def iget_values(self, i: int) -> Union[np.ndarray, ExtensionArray]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayLike
some comments on annotations, otherwise LGTM |
Honestly, I am not yet much into the typing in pandas, but I find that a rather confusing name (as in human terms, array-like means something less strict to me) |
I understand where you're coming from, but this is one of those "if you want to change the policy, thats its own discussion" things
I'm fine with |
+1 changed to that And update the type annotations. |
So I just noticed by trying to use this in #32779 that @jbrockmendel do you know if there is existing API to get one column of a 2D Timedelta/DatetimeBlock as EA ? |
There is not. We already override
|
I changed |
It tentatively looks like this is coming from an issue with Update #33290 doesnt quite do it, but a related |
looks like needs a rebase |
thanks @jorisvandenbossche |
Broken off from #32867, also mentioned this in #33229
In general, when doing certain things column-wise, we often just need the arrays (eg to calculate a reduction, to check the dtype, ..), and creating Series gives unnecessary overhead in that case.
In this PR, I added therefore two helper functions for making it easier to do this (
_ixs_values
as variant on_ixs
but returning the array instead of Series, and_iter_arrays
that calls_ixs_values
for each column iteratively as an additional helper function).I also used it in one place as illustration (in
_reduce
, what I was working on in #32867, but it can of course be used more generally).In that example case, it is to replace an apply:
(and this is only fo 10 columns)
@jbrockmendel