INT: provide helpers for accessing the values of DataFrame columns #33252

jorisvandenbossche · 2020-04-03T07:53:42Z

Broken off from #32867, also mentioned this in #33229

In general, when doing certain things column-wise, we often just need the arrays (eg to calculate a reduction, to check the dtype, ..), and creating Series gives unnecessary overhead in that case.

In this PR, I added therefore two helper functions for making it easier to do this (_ixs_values as variant on _ixs but returning the array instead of Series, and _iter_arrays that calls _ixs_values for each column iteratively as an additional helper function).

I also used it in one place as illustration (in _reduce, what I was working on in #32867, but it can of course be used more generally).
In that example case, it is to replace an apply:

In [1]: df = pd.DataFrame(np.random.randint(1000, size=(10000,10))).astype("Int64").copy() 

In [2]: from pandas.core.dtypes.common import is_datetime64_any_dtype, is_period_dtype

In [3]: %timeit df.dtypes.apply(lambda x: is_datetime64_any_dtype(x) or is_period_dtype(x)) 
256 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit np.array([is_datetime64_any_dtype(values) or is_period_dtype(values) for values in df._iter_arrays()], dtype=bool) 
56.2 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(and this is only fo 10 columns)

@jbrockmendel

WillAyd · 2020-04-03T15:01:40Z

Do you have ideas generally about places this is applicable? One potential spot could be in groupby code:

pandas/pandas/core/groupby/generic.py

Line 980 in 66e4517

def _iterate_slices(self) -> Iterable[Series]:

Which right now will yield a 2D object when there are duplicate labels but maybe shouldn't

Wondering if we should include the label as part of what is returned here

jorisvandenbossche · 2020-04-03T15:09:36Z

Wondering if we should include the label as part of what is returned here

In principle you can still do that rather easily manually, by doing for label, values in zip(df.columns, df._iter_arrays()), if you also need to the corresponding labels.
But yeah, it will depend a bit on where this would be used whether it makes sense to include this in the helper function or not.

The _iterate_slices you link to iterates through the columns as Series objects, while the function I am adding here specifically gives arrays. Now, I am not familiar enough with that specific groupby code to know if it actually needs the values as a Series.

jbrockmendel · 2020-04-03T16:41:07Z

_iterate_slices

I've got a branch motivated by trying to optimize this, ended up making a NDFrame._from_mgr constructor for just when we are passing just a BlockManager/SingleBlockManager to DataFrame/Series. Got a good deal of mileage out of that.

@jorisvandenbossche do you have a good read on what part of the Series construction is causing the overhead?

jbrockmendel · 2020-04-03T16:57:51Z

pandas/core/frame.py

+                is_datetime64_any_dtype(values.dtype) or is_period_dtype(values.dtype)
+                for values in self._iter_arrays()
+            ],
+            dtype=bool,


the perf issue here is in the self.dtypes call?

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

the perf issue here is in the self.dtypes call?

Both the self.dtypes that creates a Series, as the apply that creates another Series with inference.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

Indeed, that would be even more appropriate here.

Now, I am happy to change it to that, but that's not really the core of this PR. I mainly wanted to add one useful case as an illustration, but mainly want the helper function for other PRs. I am happy to merge it without already using it somewhere as well.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

When having a dataframe with only extension blocks, it's actually slower than [v.dtype for v in df._iter_arrays()] (for a dataframe with a 2D block it will of course be faster)

since we're just going to call .any() on this anyway, could do the any up-front on the iterator, no big deal

jorisvandenbossche · 2020-04-03T17:01:38Z

do you have a good read on what part of the Series construction is causing the overhead?

I didn't check this time any more (looked at it previously), but here, when the only thing you need is an array, any overhead that the Series constructor gives is easily avoided overhead. It's not that this PR is adding complex code or so. I think it will be a useful helper function regularly (I added one use case here, and the two linked PRs can be two other use cases already).

I suppose that the Series constructor overhead can be decreased (which would certainly be useful anyway), but it might require some additional keywords (eg a fastpath that avoids sanitize_array) or separate private constructor like _from_array (similarly to how DataFrame._from_arrays).

jreback

looks ok, are there some tests for this specifically?

jorisvandenbossche · 2020-04-03T17:50:16Z

I don't think tests are necessarily needed, since it's an internal helper function (most private methods don't have specific tests), but I can add a simple one if needed.

jreback · 2020-04-03T17:55:46Z

pandas/core/frame.py

+        """
+        return self._data.iget_values(i)
+
+    def _iter_arrays(self):


shouldn't this by typed as a Generator?

Iterator, apparently (https://mypy.readthedocs.io/en/stable/kinds_of_types.html#generators), added that

jreback · 2020-04-03T17:56:51Z

pandas/core/frame.py

@@ -2568,6 +2568,21 @@ def _ixs(self, i: int, axis: int = 0):

            return result

+    def _ixs_values(self, i: int) -> Union[np.ndarray, ExtensionArray]:


let's name this _iget_values for consistency no? (ixs sounds like cross-section, e.g. rows which this is not)

There is no _get_values on DataFrame, so for DataFrame that gives no consistency, but of course more consistent with the block method. Anyway, I don't really have a preference, either way.

"Union[np.ndarray, ExtensionArray]" should be "ArrayLike"

@jbrockmendel do you have a preference for the name?

jreback · 2020-04-03T17:57:19Z

I don't think tests are necessarily needed, since it's an internal helper function (most private methods don't have specific tests), but I can add a simple one if needed.

ok no big deal

jbrockmendel · 2020-04-04T18:00:26Z

pandas/core/frame.py

@@ -2568,6 +2569,21 @@ def _ixs(self, i: int, axis: int = 0):

            return result

+    def _ixs_values(self, i: int) -> Union[np.ndarray, ExtensionArray]:
+        """
+        Get the values of the ith column (ndarray or ExtensionArray, as stored


"ith" -> "i'th"

jbrockmendel · 2020-04-04T18:00:50Z

pandas/core/frame.py

+        """
+        return self._data.iget_values(i)
+
+    def _iter_arrays(self) -> Iterator[Union[np.ndarray, ExtensionArray]]:


"Union[np.ndarray, ExtensionArray]" -> "ArrayLike:"

jbrockmendel · 2020-04-04T18:01:02Z

pandas/core/frame.py

+    def _iter_arrays(self) -> Iterator[Union[np.ndarray, ExtensionArray]]:
+        """
+        Iterate over the arrays of all columns in order.
+        This returns the values as stored in the Block (ndarray or ExtensionArray).


i think another newline?

jbrockmendel · 2020-04-04T18:02:38Z

pandas/core/internals/managers.py

@@ -1001,6 +1002,14 @@ def iget(self, i: int) -> "SingleBlockManager":
            self.axes[1],
        )

+    def iget_values(self, i: int) -> Union[np.ndarray, ExtensionArray]:


jbrockmendel · 2020-04-04T18:03:12Z

some comments on annotations, otherwise LGTM

jorisvandenbossche · 2020-04-04T18:18:17Z

"Union[np.ndarray, ExtensionArray]" -> "ArrayLike:"

Honestly, I am not yet much into the typing in pandas, but I find that a rather confusing name (as in human terms, array-like means something less strict to me)

jbrockmendel · 2020-04-04T20:06:05Z

Honestly, I am not yet much into the typing in pandas, but I find that a rather confusing name (as in human terms, array-like means something less strict to me)

I understand where you're coming from, but this is one of those "if you want to change the policy, thats its own discussion" things

do you have a preference for the name?

I'm fine with _ixs_values. _iter_arrays might be clearer as _iter_column_arrays, in which case it might make sense for _ixs_values to become _get_column_array

jorisvandenbossche · 2020-04-06T07:48:21Z

_iter_arrays might be clearer as _iter_column_arrays, in which case it might make sense for _ixs_values to become _get_column_array

+1 changed to that

And update the type annotations.

jorisvandenbossche · 2020-04-06T11:57:06Z

So I just noticed by trying to use this in #32779 that Block.iget does not what I expected for Timedelta/DatetimeBlock (it returns numpy array instead of extension array).

@jbrockmendel do you know if there is existing API to get one column of a 2D Timedelta/DatetimeBlock as EA ? array_values only works assuming you have a 2D block. I could in principle add an option to Block.iget, but that is also not really clean.

jbrockmendel · 2020-04-06T16:11:56Z

do you know if there is existing API to get one column of a 2D Timedelta/DatetimeBlock as EA ?

There is not. We already override DatetimeLikeBlockMixin.iget, what happens if we change that to

def iget(self, key):
    # TODO(EA2D): the reshape will be unnecessary with 2D EAs
    return self.array_values().reshape(self.shape)[key]

jorisvandenbossche · 2020-04-06T17:40:27Z

I changed iget earlier today to return TimedeltaArray, and there are quite some test failures then. I didn't really investigate, but at least we have some code assuming that iget returns a ndarray. I think many errors came from _simple_new asserting values being a ndarray

jbrockmendel · 2020-04-06T18:46:47Z

I changed iget earlier today to return TimedeltaArray, and there are quite some test failures then. I didn't really investigate, but at least we have some code assuming that iget returns a ndarray. I think many errors came from _simple_new asserting values being a ndarray

It tentatively looks like this is coming from an issue with DTA.__getitem__ that may be fixed by #33290.

Update #33290 doesnt quite do it, but a related DTA.__getitem__ bugfix (that ill push after 33290 goes through) does.

jreback · 2020-04-06T21:36:19Z

looks like needs a rebase

jreback · 2020-04-10T17:37:26Z

thanks @jorisvandenbossche

INT: provide helpers for accessing the values of DataFrame columns

2afaaad

jorisvandenbossche added this to the 1.1 milestone Apr 3, 2020

jorisvandenbossche mentioned this pull request Apr 3, 2020

REF: sql insert_data operate column-wise to avoid internals #33229

Merged

jbrockmendel reviewed Apr 3, 2020

View reviewed changes

jreback requested changes Apr 3, 2020

View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Apr 3, 2020

jreback reviewed Apr 3, 2020

View reviewed changes

add typing

18203e8

jbrockmendel reviewed Apr 4, 2020

View reviewed changes

jorisvandenbossche added 2 commits April 6, 2020 09:41

Merge remote-tracking branch 'upstream/master' into helper-df-get-arrays

b1434fe

update type annotations + rename

8e29685

fixup rename

f472232

jbrockmendel mentioned this pull request Apr 6, 2020

PERF: block-wise arithmetic for frame-with-frame #32779

Merged

5 tasks

jbrockmendel mentioned this pull request Apr 6, 2020

BUG: scalar indexing on 2D DTA/TDA/PA #33342

Merged

jorisvandenbossche added 2 commits April 7, 2020 09:02

Merge remote-tracking branch 'upstream/master' into helper-df-get-arrays

562635c

Merge remote-tracking branch 'upstream/master' into helper-df-get-arrays

a25cac8

jreback approved these changes Apr 10, 2020

View reviewed changes

jreback merged commit a2cdd50 into pandas-dev:master Apr 10, 2020

jorisvandenbossche deleted the helper-df-get-arrays branch April 10, 2020 17:57

jorisvandenbossche mentioned this pull request Oct 13, 2020

REF: ignore_failures in BlockManager.reduce #35881

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT: provide helpers for accessing the values of DataFrame columns #33252

INT: provide helpers for accessing the values of DataFrame columns #33252

jorisvandenbossche commented Apr 3, 2020 •

edited

Loading

WillAyd commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020 •

edited

Loading

jbrockmendel commented Apr 3, 2020

jbrockmendel Apr 3, 2020

jbrockmendel Apr 3, 2020

jorisvandenbossche Apr 3, 2020

jorisvandenbossche Apr 3, 2020

jbrockmendel Apr 4, 2020

jorisvandenbossche commented Apr 3, 2020

jreback left a comment

jorisvandenbossche commented Apr 3, 2020

jreback Apr 3, 2020

jorisvandenbossche Apr 3, 2020

jreback Apr 3, 2020

jorisvandenbossche Apr 3, 2020

jbrockmendel Apr 4, 2020

jorisvandenbossche Apr 4, 2020

jreback commented Apr 3, 2020

jbrockmendel Apr 4, 2020

jbrockmendel Apr 4, 2020

jbrockmendel Apr 4, 2020

jbrockmendel Apr 4, 2020

jbrockmendel commented Apr 4, 2020

jorisvandenbossche commented Apr 4, 2020

jbrockmendel commented Apr 4, 2020

jorisvandenbossche commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

jbrockmendel commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

jbrockmendel commented Apr 6, 2020 •

edited

Loading

jreback commented Apr 6, 2020

jreback commented Apr 10, 2020

		@@ -2568,6 +2568,21 @@ def _ixs(self, i: int, axis: int = 0):

		return result

		def _ixs_values(self, i: int) -> Union[np.ndarray, ExtensionArray]:

INT: provide helpers for accessing the values of DataFrame columns #33252

INT: provide helpers for accessing the values of DataFrame columns #33252

Conversation

jorisvandenbossche commented Apr 3, 2020 • edited Loading

WillAyd commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020 • edited Loading

jbrockmendel commented Apr 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 3, 2020

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Apr 4, 2020

jorisvandenbossche commented Apr 4, 2020

jbrockmendel commented Apr 4, 2020

jorisvandenbossche commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

jbrockmendel commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

jbrockmendel commented Apr 6, 2020 • edited Loading

jreback commented Apr 6, 2020

jreback commented Apr 10, 2020

jorisvandenbossche commented Apr 3, 2020 •

edited

Loading

jorisvandenbossche commented Apr 3, 2020 •

edited

Loading

jbrockmendel commented Apr 6, 2020 •

edited

Loading