Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT: provide helpers for accessing the values of DataFrame columns #33252

Merged
merged 7 commits into from
Apr 10, 2020

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 3, 2020

Broken off from #32867, also mentioned this in #33229

In general, when doing certain things column-wise, we often just need the arrays (eg to calculate a reduction, to check the dtype, ..), and creating Series gives unnecessary overhead in that case.

In this PR, I added therefore two helper functions for making it easier to do this (_ixs_values as variant on _ixs but returning the array instead of Series, and _iter_arrays that calls _ixs_values for each column iteratively as an additional helper function).

I also used it in one place as illustration (in _reduce, what I was working on in #32867, but it can of course be used more generally).
In that example case, it is to replace an apply:

In [1]: df = pd.DataFrame(np.random.randint(1000, size=(10000,10))).astype("Int64").copy() 

In [2]: from pandas.core.dtypes.common import is_datetime64_any_dtype, is_period_dtype

In [3]: %timeit df.dtypes.apply(lambda x: is_datetime64_any_dtype(x) or is_period_dtype(x)) 
256 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit np.array([is_datetime64_any_dtype(values) or is_period_dtype(values) for values in df._iter_arrays()], dtype=bool) 
56.2 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(and this is only fo 10 columns)

@jbrockmendel

@WillAyd
Copy link
Member

WillAyd commented Apr 3, 2020

Do you have ideas generally about places this is applicable? One potential spot could be in groupby code:

def _iterate_slices(self) -> Iterable[Series]:

Which right now will yield a 2D object when there are duplicate labels but maybe shouldn't

Wondering if we should include the label as part of what is returned here

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 3, 2020

Wondering if we should include the label as part of what is returned here

In principle you can still do that rather easily manually, by doing for label, values in zip(df.columns, df._iter_arrays()), if you also need to the corresponding labels.
But yeah, it will depend a bit on where this would be used whether it makes sense to include this in the helper function or not.

The _iterate_slices you link to iterates through the columns as Series objects, while the function I am adding here specifically gives arrays. Now, I am not familiar enough with that specific groupby code to know if it actually needs the values as a Series.

@jbrockmendel
Copy link
Member

_iterate_slices

I've got a branch motivated by trying to optimize this, ended up making a NDFrame._from_mgr constructor for just when we are passing just a BlockManager/SingleBlockManager to DataFrame/Series. Got a good deal of mileage out of that.

@jorisvandenbossche do you have a good read on what part of the Series construction is causing the overhead?

is_datetime64_any_dtype(values.dtype) or is_period_dtype(values.dtype)
for values in self._iter_arrays()
],
dtype=bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the perf issue here is in the self.dtypes call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the perf issue here is in the self.dtypes call?

Both the self.dtypes that creates a Series, as the apply that creates another Series with inference.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

Indeed, that would be even more appropriate here.

Now, I am happy to change it to that, but that's not really the core of this PR. I mainly wanted to add one useful case as an illustration, but mainly want the helper function for other PRs. I am happy to merge it without already using it somewhere as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

When having a dataframe with only extension blocks, it's actually slower than [v.dtype for v in df._iter_arrays()] (for a dataframe with a 2D block it will of course be faster)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're just going to call .any() on this anyway, could do the any up-front on the iterator, no big deal

@jorisvandenbossche
Copy link
Member Author

do you have a good read on what part of the Series construction is causing the overhead?

I didn't check this time any more (looked at it previously), but here, when the only thing you need is an array, any overhead that the Series constructor gives is easily avoided overhead. It's not that this PR is adding complex code or so. I think it will be a useful helper function regularly (I added one use case here, and the two linked PRs can be two other use cases already).

I suppose that the Series constructor overhead can be decreased (which would certainly be useful anyway), but it might require some additional keywords (eg a fastpath that avoids sanitize_array) or separate private constructor like _from_array (similarly to how DataFrame._from_arrays).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, are there some tests for this specifically?

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Apr 3, 2020
@jorisvandenbossche
Copy link
Member Author

I don't think tests are necessarily needed, since it's an internal helper function (most private methods don't have specific tests), but I can add a simple one if needed.

"""
return self._data.iget_values(i)

def _iter_arrays(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this by typed as a Generator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -2568,6 +2568,21 @@ def _ixs(self, i: int, axis: int = 0):

return result

def _ixs_values(self, i: int) -> Union[np.ndarray, ExtensionArray]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's name this _iget_values for consistency no? (ixs sounds like cross-section, e.g. rows which this is not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no _get_values on DataFrame, so for DataFrame that gives no consistency, but of course more consistent with the block method. Anyway, I don't really have a preference, either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Union[np.ndarray, ExtensionArray]" should be "ArrayLike"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel do you have a preference for the name?

@jreback
Copy link
Contributor

jreback commented Apr 3, 2020

I don't think tests are necessarily needed, since it's an internal helper function (most private methods don't have specific tests), but I can add a simple one if needed.

ok no big deal

@@ -2568,6 +2569,21 @@ def _ixs(self, i: int, axis: int = 0):

return result

def _ixs_values(self, i: int) -> Union[np.ndarray, ExtensionArray]:
"""
Get the values of the ith column (ndarray or ExtensionArray, as stored
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ith" -> "i'th"

"""
return self._data.iget_values(i)

def _iter_arrays(self) -> Iterator[Union[np.ndarray, ExtensionArray]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Union[np.ndarray, ExtensionArray]" -> "ArrayLike:"

def _iter_arrays(self) -> Iterator[Union[np.ndarray, ExtensionArray]]:
"""
Iterate over the arrays of all columns in order.
This returns the values as stored in the Block (ndarray or ExtensionArray).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think another newline?

@@ -1001,6 +1002,14 @@ def iget(self, i: int) -> "SingleBlockManager":
self.axes[1],
)

def iget_values(self, i: int) -> Union[np.ndarray, ExtensionArray]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayLike

@jbrockmendel
Copy link
Member

some comments on annotations, otherwise LGTM

@jorisvandenbossche
Copy link
Member Author

"Union[np.ndarray, ExtensionArray]" -> "ArrayLike:"

Honestly, I am not yet much into the typing in pandas, but I find that a rather confusing name (as in human terms, array-like means something less strict to me)

@jbrockmendel
Copy link
Member

Honestly, I am not yet much into the typing in pandas, but I find that a rather confusing name (as in human terms, array-like means something less strict to me)

I understand where you're coming from, but this is one of those "if you want to change the policy, thats its own discussion" things

do you have a preference for the name?

I'm fine with _ixs_values. _iter_arrays might be clearer as _iter_column_arrays, in which case it might make sense for _ixs_values to become _get_column_array

@jorisvandenbossche
Copy link
Member Author

_iter_arrays might be clearer as _iter_column_arrays, in which case it might make sense for _ixs_values to become _get_column_array

+1 changed to that

And update the type annotations.

@jorisvandenbossche
Copy link
Member Author

So I just noticed by trying to use this in #32779 that Block.iget does not what I expected for Timedelta/DatetimeBlock (it returns numpy array instead of extension array).

@jbrockmendel do you know if there is existing API to get one column of a 2D Timedelta/DatetimeBlock as EA ? array_values only works assuming you have a 2D block. I could in principle add an option to Block.iget, but that is also not really clean.

@jbrockmendel
Copy link
Member

do you know if there is existing API to get one column of a 2D Timedelta/DatetimeBlock as EA ?

There is not. We already override DatetimeLikeBlockMixin.iget, what happens if we change that to

def iget(self, key):
    # TODO(EA2D): the reshape will be unnecessary with 2D EAs
    return self.array_values().reshape(self.shape)[key]

@jorisvandenbossche
Copy link
Member Author

I changed iget earlier today to return TimedeltaArray, and there are quite some test failures then. I didn't really investigate, but at least we have some code assuming that iget returns a ndarray. I think many errors came from _simple_new asserting values being a ndarray

@jbrockmendel
Copy link
Member

jbrockmendel commented Apr 6, 2020

I changed iget earlier today to return TimedeltaArray, and there are quite some test failures then. I didn't really investigate, but at least we have some code assuming that iget returns a ndarray. I think many errors came from _simple_new asserting values being a ndarray

It tentatively looks like this is coming from an issue with DTA.__getitem__ that may be fixed by #33290.

Update #33290 doesnt quite do it, but a related DTA.__getitem__ bugfix (that ill push after 33290 goes through) does.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2020

looks like needs a rebase

@jreback jreback merged commit a2cdd50 into pandas-dev:master Apr 10, 2020
@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

thanks @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants