BUG: preserve EA dtype in transpose #30091

TomAugspurger · 2019-12-05T20:21:21Z

No description provided.

jbrockmendel · 2019-12-05T20:35:29Z

does this subsume #28048?

jbrockmendel · 2019-12-05T20:36:31Z

doc/source/whatsnew/v1.0.0.rst

@@ -775,6 +775,7 @@ Reshaping
 - Bug where :meth:`DataFrame.equals` returned True incorrectly in some cases when two DataFrames had the same columns in different orders (:issue:`28839`)
 - Bug in :meth:`DataFrame.replace` that caused non-numeric replacer's dtype not respected (:issue:`26632`)
 - Bug in :func:`melt` where supplying mixed strings and numeric values for ``id_vars`` or ``value_vars`` would incorrectly raise a ``ValueError`` (:issue:`29718`)
+- Dtypes are now preserved when transposing a ``DataFrame`` where each column is the same extension dtyep (:issue:``)


dtyep -> dtype

TomAugspurger · 2019-12-05T21:33:56Z

does this subsume #28048?

Ah, yeah forgot about that one.

jbrockmendel · 2019-12-06T00:14:31Z

Easy to forget about, i closed it to clear the queue and then only reopened it a couple days ago. I'm going to re-close it in favor or this one; maybe its worth salvaging some of the tests it implemented

TomAugspurger · 2019-12-06T13:32:35Z

pandas/core/generic.py

+
+            # Slow, but unavoidable with 1D EAs.
+            new_values = []
+            for i in range(len(self)):


I'm rethinking this approach. This results in n_rows * n_columns __getitem__s. My intent was to avoid going through a 2D object-dtype ndarray. But we're essentially doing that with lists. So I think it'll be better to just do .values.T and then rebuild the EAs from the object-dtype array.

jreback · 2019-12-06T14:42:53Z

pandas/core/generic.py

+            kwargs.pop("copy", None)  # by definition, we're copying
+            dtype = self._data.blocks[0].dtype
+            arr_type = dtype.construct_array_type()
+


I would move this logic to pandas/core/reshape/reshape.py this has a lot of similiarity to _unstack_extension_series

jbrockmendel · 2019-12-06T17:16:35Z

pandas/core/generic.py

+        if (
+            self._is_homogeneous_type
+            and len(self._data.blocks)
+            and is_extension_array_dtype(self._data.blocks[0].dtype)


we can avoid self._data references by making this len(self.dtypes) and is_extension_array_dtype(self.dtypes.iloc[0])

ditto on 731 with self.dtypes

TomAugspurger · 2019-12-06T18:11:38Z

Pushed a largeish refactor. We actually don't need NDFrame.transpose anymore. Series gets its from IndexOpsMixin. So I moved the logic to DataFrame.transpose, and was able to remove all the axes handling stuff.

maybe its worth salvaging some of the tests it implemented

Added.

jbrockmendel · 2019-12-06T18:16:41Z

pandas/core/frame.py

+        if self._is_homogeneous_type and is_extension_array_dtype(self.iloc[:, 0]):
+            dtype = self.dtypes.iloc[0]
+            arr_type = dtype.construct_array_type()
+            values = self.values


IIRC the single-column case could be done without this casting. think its worth special-casing?

I don't think it can be done without casting in general. We'll still need to reshape the (N, 1) to (1, N).

im pretty sure i did this in the other PR, something like:

values = self._data.blocks[0].values new_vals = [values[[n]] for n in range(len(values))]

(of course, if we had 2D EAs...)

So, we can avoid casting to an ndarray by making N * P length-1 __getitem__ calls, which makes N * P extension arrays, which are concatenated into N final EAs.

My prior expectation is that converting to an ndarray and doing __getitem__ on that will be faster, and should have roughly the same amount of memory usage.

what is P here? in the end this probably isn't worth bikeshedding (except to add to the pile of "reasons why EAs should support 2D")

Number of columns.

right, i was specifically referring to single-column

I'm OK with not-doing this optimization here, just want to make sure we're on the same page about what the available optimization is

jreback

looks good

jreback · 2019-12-10T13:41:57Z

pandas/core/frame.py

-        *args, **kwargs
-            Additional arguments and keywords have no effect but might be
-            accepted for compatibility with numpy.
+        *args : tuple, optional


don't we still need kwargs for this?

AFAICT, no. Neither np.transpose nor ndarray.transpose take additional keyword arguments.

pandas/core/frame.py

jreback · 2019-12-10T13:42:51Z

pandas/core/generic.py

@@ -644,50 +644,6 @@ def _set_axis(self, axis, labels):
        self._data.set_axis(axis, labels)
        self._clear_item_cache()

-    def transpose(self, *args, **kwargs):


does Series have transpose for compat?

Yes, via IndexOpsMixin.

pandas/core/frame.py

jreback · 2019-12-10T13:44:27Z

pandas/core/frame.py

@@ -2587,7 +2592,28 @@ def transpose(self, *args, **kwargs):
        dtype: object
        """
        nv.validate_transpose(args, dict())
-        return super().transpose(1, 0, **kwargs)
+        # construct the args


don't you think this is better located in pandas/core/reshape ? (and called as a helper function here)

Maybe a slight preference for keeping it here just for readability. The reshape part is essentially just a list comprehension.

values = self.values new_values = [arr_type._from_sequence(row, dtype=dtype) for row in values]

which I don't think warrants its own function. I don't see anything places in /core/reshape.py that could use this. I believe those are reshaping to / from 1-D things. This is 2D -> 2D.

But happy to move it if you want / if you see other places that could use parts of this.

pandas/core/reshape/reshape.py

pandas/tests/arithmetic/conftest.py

pandas/tests/arithmetic/test_period.py

TomAugspurger · 2019-12-17T17:26:37Z

I think this is ready.

jbrockmendel · 2019-12-18T20:24:03Z

pandas/core/frame.py

+        else:
+            new_values = self.values.T
+            if copy:
+                new_values = new_values.copy()


if non-homogeneous, then new_values above is already a copy, can avoid re-copying here

Are we 100% sure about that? Or are there types distinct dtypes that .values can combine without copy?

if non-homogeneous then we have multiple blocks, so multiple ndarrays that are going through np.c_ or something like it, right? AFAIK that has to allocate new data for the output. are there corner cases were missing @shoyer?

jbrockmendel · 2019-12-18T20:27:32Z

pandas/tests/arithmetic/test_period.py

@@ -755,18 +755,18 @@ def test_pi_sub_isub_offset(self):
        rng -= pd.offsets.MonthEnd(5)
        tm.assert_index_equal(rng, expected)

-    def test_pi_add_offset_n_gt1(self, box_transpose_fail):
+    @pytest.mark.parametrize("transpose", [True, False])
+    def test_pi_add_offset_n_gt1(self, box_with_array, transpose):


not a blocker, but transpose param is sub-optimal. For DataFrame case it will correctly test both cases, but for EA/Index/Series it will mean duplicate tests. I'll try to come up with something nicer in an upcoming arithmetic-test-specific pass

pandas/tests/arithmetic/test_period.py

pandas/tests/extension/json/test_json.py

jbrockmendel · 2019-12-18T20:29:50Z

pandas/tests/frame/test_operators.py

+        df = pd.DataFrame({"a": ser, "b": ser})
+        result = df.T
+        assert (result.dtypes == ser.dtype).all()
+


round_trip = df.T.T tm.assert_frame_equal(df, round_trip)

?

Interestingly, the pd.date_range("2016-04-05 04:30", periods=3).astype("category") case fails that test. All the values are NaT.

I've xfalied it for now, and likely won't have time to look into it.

sounds good, thanks

jbrockmendel · 2019-12-18T20:31:31Z

A handful of comments, generally looks good

jreback · 2019-12-26T23:11:31Z

@TomAugspurger merge master when you have a chance (or @jbrockmendel if this is a blocker)

…-transpose

jbrockmendel · 2019-12-27T02:22:10Z

rebased. not a blocker for the blockwise PRs

jreback · 2019-12-27T15:05:22Z

thanks @TomAugspurger

@jbrockmendel I don't think we had a dedicated issue for this to be closed......

jbrockmendel · 2019-12-27T16:37:40Z

#22120, since its about cyberpandas, i think we should ask tom to double-check that this fixes it.

…ndexing-1row-df * upstream/master: (333 commits) CI: troubleshoot Web_and_Docs failing (pandas-dev#30534) WARN: Ignore NumbaPerformanceWarning in test suite (pandas-dev#30525) DEPR: camelCase in offsets, get_offset (pandas-dev#30340) PERF: implement scalar ops blockwise (pandas-dev#29853) DEPR: Remove Series.compress (pandas-dev#30514) ENH: Add numba engine for rolling apply (pandas-dev#30151) [ENH] Add to_markdown method (pandas-dev#30350) DEPR: Deprecate pandas.np module (pandas-dev#30386) ENH: Add ignore_index for df.drop_duplicates (pandas-dev#30405) BUG: The setting xrot=0 in DataFrame.hist() doesn't work with by and subplots pandas-dev#30288 (pandas-dev#30491) CI: Fix GBQ Tests (pandas-dev#30478) Bug groupby quantile listlike q and int columns (pandas-dev#30485) ENH: Add ignore_index for df.sort_values and series.sort_values (pandas-dev#30402) TYP: Typing hints in pandas/io/formats/{css,csvs}.py (pandas-dev#30398) BUG: raise on non-hashable Index name, closes pandas-dev#29069 (pandas-dev#30335) Replace "foo!r" to "repr(foo)" syntax pandas-dev#29886 (pandas-dev#30502) BUG: preserve EA dtype in transpose (pandas-dev#30091) BLD: add check to prevent tempita name error, clsoes pandas-dev#28836 (pandas-dev#30498) REF/TST: method-specific files for test_append (pandas-dev#30503) marked unused parameters (pandas-dev#30504) ...

BUG: preserve EA dtype in transpose

a2217e0

TomAugspurger mentioned this pull request Dec 5, 2019

REF: implement cumulative ops block-wise #29872

Merged

5 tasks

jbrockmendel reviewed Dec 5, 2019

View reviewed changes

TomAugspurger added 2 commits December 5, 2019 15:32

fix typo

4fb44c5

remove xpass

e18a426

jbrockmendel mentioned this pull request Dec 6, 2019

BUG: retain extension dtypes in transpose #28048

Closed

5 tasks

TomAugspurger commented Dec 6, 2019

View reviewed changes

jreback requested changes Dec 6, 2019

View reviewed changes

jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 6, 2019

jbrockmendel reviewed Dec 6, 2019

View reviewed changes

TomAugspurger added 5 commits December 6, 2019 11:44

simplify

9be80b7

Merge remote-tracking branch 'upstream/master' into ea-transpose

52d3b9c

simplify

bfdfccf

steal tests

10d81bd

update docs

132472d

jbrockmendel reviewed Dec 6, 2019

View reviewed changes

TomAugspurger added 4 commits December 6, 2019 13:17

fixup

6aae8d6

Merge remote-tracking branch 'upstream/master' into ea-transpose

5448bbb

empty

9d703a2

Merge remote-tracking branch 'upstream/master' into ea-transpose

8b8a464

jreback requested changes Dec 10, 2019

View reviewed changes

jbrockmendel reviewed Dec 10, 2019

View reviewed changes

pandas/core/reshape/reshape.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Dec 10, 2019

View reviewed changes

pandas/tests/arithmetic/conftest.py Show resolved Hide resolved

jbrockmendel reviewed Dec 10, 2019

View reviewed changes

pandas/tests/arithmetic/test_period.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into ea-transpose

79550bd

TomAugspurger added 2 commits December 12, 2019 08:02

box

feecee8

Merge remote-tracking branch 'upstream/master' into ea-transpose

dada1a6

jbrockmendel reviewed Dec 18, 2019

View reviewed changes

pandas/tests/arithmetic/test_period.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Dec 18, 2019

View reviewed changes

pandas/tests/extension/json/test_json.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Dec 18, 2019

View reviewed changes

TomAugspurger added 3 commits December 20, 2019 06:40

Merge remote-tracking branch 'upstream/master' into ea-transpose

1e58c4e

update test.

9abd6c2

filter

f6b3c37

jreback added this to the 1.0 milestone Dec 26, 2019

jbrockmendel added 2 commits December 26, 2019 16:35

Merge branch 'master' of https://github.com/pandas-dev/pandas into ea…

4675de6

…-transpose

fixup unused import

6d9daa8

jreback approved these changes Dec 27, 2019

View reviewed changes

jreback merged commit cb5f9d1 into pandas-dev:master Dec 27, 2019

AlexKirko pushed a commit to AlexKirko/pandas that referenced this pull request Dec 29, 2019

BUG: preserve EA dtype in transpose (pandas-dev#30091)

5054b34

TomAugspurger deleted the ea-transpose branch February 28, 2020 23:53

TomAugspurger mentioned this pull request Feb 28, 2020

Transposing dataframe loses dtype and ExtensionArray #22120

Closed

rhshadrach mentioned this pull request Jul 8, 2021

BUG: 1.3.0 DataFrame.agg over categorical columns with non-unique index returns wrong size result #42380

Closed

3 tasks

simonjayhawkins mentioned this pull request Jul 10, 2021

REGR: DataFrame.agg with axis=1, EA dtype, and duplicate index #42449

Merged

4 tasks

BUG: preserve EA dtype in transpose #30091

BUG: preserve EA dtype in transpose #30091

Conversation

TomAugspurger commented Dec 5, 2019

jbrockmendel commented Dec 5, 2019

Choose a reason for hiding this comment

TomAugspurger commented Dec 5, 2019

jbrockmendel commented Dec 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Dec 18, 2019

jreback commented Dec 26, 2019

jbrockmendel commented Dec 27, 2019

jreback commented Dec 27, 2019

jbrockmendel commented Dec 27, 2019