Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: preserve EA dtype in transpose #30091

Merged
merged 20 commits into from
Dec 27, 2019
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -841,6 +841,7 @@ Reshaping
- Bug where :meth:`DataFrame.equals` returned True incorrectly in some cases when two DataFrames had the same columns in different orders (:issue:`28839`)
- Bug in :meth:`DataFrame.replace` that caused non-numeric replacer's dtype not respected (:issue:`26632`)
- Bug in :func:`melt` where supplying mixed strings and numeric values for ``id_vars`` or ``value_vars`` would incorrectly raise a ``ValueError`` (:issue:`29718`)
- Dtypes are now preserved when transposing a ``DataFrame`` where each column is the same extension dtype (:issue:`30091`)
- Bug in :func:`merge_asof` merging on a tz-aware ``left_index`` and ``right_on`` a tz-aware column (:issue:`29864`)
-

Expand Down
37 changes: 32 additions & 5 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2493,7 +2493,7 @@ def memory_usage(self, index=True, deep=False):
)
return result

def transpose(self, *args, **kwargs):
def transpose(self, *args, copy: bool = False):
"""
Transpose index and columns.

Expand All @@ -2503,9 +2503,14 @@ def transpose(self, *args, **kwargs):

Parameters
----------
*args, **kwargs
Additional arguments and keywords have no effect but might be
accepted for compatibility with numpy.
*args : tuple, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we still need kwargs for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, no. Neither np.transpose nor ndarray.transpose take additional keyword arguments.

Accepted for compatibility with NumPy.
copy : bool, default False
Whether to copy the data after transposing, even for DataFrames
jreback marked this conversation as resolved.
Show resolved Hide resolved
with a single dtype.

Note that a copy is always required for mixed dtype DataFrames,
or for DataFrames with any extension types.

Returns
-------
Expand Down Expand Up @@ -2586,7 +2591,29 @@ def transpose(self, *args, **kwargs):
dtype: object
"""
nv.validate_transpose(args, dict())
return super().transpose(1, 0, **kwargs)
# construct the args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you think this is better located in pandas/core/reshape ? (and called as a helper function here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a slight preference for keeping it here just for readability. The reshape part is essentially just a list comprehension.

values = self.values
new_values = [arr_type._from_sequence(row, dtype=dtype) for row in values]

which I don't think warrants its own function. I don't see anything places in /core/reshape.py that could use this. I believe those are reshaping to / from 1-D things. This is 2D -> 2D.

But happy to move it if you want / if you see other places that could use parts of this.


dtypes = list(self.dtypes)
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
if self._is_homogeneous_type and dtypes and is_extension_array_dtype(dtypes[0]):
# We have EAs with the same dtype. We can preserve that dtype in transpose.
dtype = dtypes[0]
arr_type = dtype.construct_array_type()
values = self.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the single-column case could be done without this casting. think its worth special-casing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be done without casting in general. We'll still need to reshape the (N, 1) to (1, N).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im pretty sure i did this in the other PR, something like:

values = self._data.blocks[0].values
new_vals = [values[[n]] for n in range(len(values))]

(of course, if we had 2D EAs...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we can avoid casting to an ndarray by making N * P length-1 __getitem__ calls, which makes N * P extension arrays, which are concatenated into N final EAs.

My prior expectation is that converting to an ndarray and doing __getitem__ on that will be faster, and should have roughly the same amount of memory usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is P here? in the end this probably isn't worth bikeshedding (except to add to the pile of "reasons why EAs should support 2D")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, i was specifically referring to single-column

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with not-doing this optimization here, just want to make sure we're on the same page about what the available optimization is


new_values = [arr_type._from_sequence(row, dtype=dtype) for row in values]
result = self._constructor(
dict(zip(self.index, new_values)), index=self.columns
)

else:
new_values = self.values.T
if copy:
new_values = new_values.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if non-homogeneous, then new_values above is already a copy, can avoid re-copying here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we 100% sure about that? Or are there types distinct dtypes that .values can combine without copy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if non-homogeneous then we have multiple blocks, so multiple ndarrays that are going through np.c_ or something like it, right? AFAIK that has to allocate new data for the output. are there corner cases were missing @shoyer?

result = self._constructor(
new_values, index=self.columns, columns=self.index
)

return result.__finalize__(self)

T = property(transpose)

Expand Down
44 changes: 0 additions & 44 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -644,50 +644,6 @@ def _set_axis(self, axis, labels):
self._data.set_axis(axis, labels)
self._clear_item_cache()

def transpose(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Series have transpose for compat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, via IndexOpsMixin.

"""
Permute the dimensions of the %(klass)s

Parameters
----------
args : %(args_transpose)s
copy : bool, default False
Make a copy of the underlying data. Mixed-dtype data will
always result in a copy
**kwargs
Additional keyword arguments will be passed to the function.

Returns
-------
y : same as input

Examples
--------
>>> p.transpose(2, 0, 1)
>>> p.transpose(2, 0, 1, copy=True)
"""

# construct the args
axes, kwargs = self._construct_axes_from_arguments(
args, kwargs, require_all=True
)
axes_names = tuple(self._get_axis_name(axes[a]) for a in self._AXIS_ORDERS)
axes_numbers = tuple(self._get_axis_number(axes[a]) for a in self._AXIS_ORDERS)

# we must have unique axes
if len(axes) != len(set(axes)):
raise ValueError(f"Must specify {self._AXIS_LEN} unique axes")

new_axes = self._construct_axes_dict_from(
self, [self._get_axis(x) for x in axes_names]
)
new_values = self.values.transpose(axes_numbers)
if kwargs.pop("copy", None) or (len(args) and args[-1]):
new_values = new_values.copy()

nv.validate_transpose(tuple(), kwargs)
return self._constructor(new_values, **new_axes).__finalize__(self)

def swapaxes(self, axis1, axis2, copy=True):
"""
Interchange axes and swap values axes appropriately.
Expand Down
19 changes: 0 additions & 19 deletions pandas/tests/arithmetic/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,25 +235,6 @@ def box_df_fail(request):
return request.param


@pytest.fixture(
params=[
(pd.Index, False),
(pd.Series, False),
(pd.DataFrame, False),
pytest.param((pd.DataFrame, True), marks=pytest.mark.xfail),
(tm.to_array, False),
],
ids=id_func,
)
def box_transpose_fail(request):
"""
Fixture similar to `box` but testing both transpose cases for DataFrame,
with the tranpose=True case xfailed.
"""
# GH#23620
return request.param


TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
@pytest.fixture(params=[pd.Index, pd.Series, pd.DataFrame, tm.to_array], ids=id_func)
def box_with_array(request):
"""
Expand Down
24 changes: 11 additions & 13 deletions pandas/tests/arithmetic/test_period.py
Original file line number Diff line number Diff line change
Expand Up @@ -755,18 +755,18 @@ def test_pi_sub_isub_offset(self):
rng -= pd.offsets.MonthEnd(5)
tm.assert_index_equal(rng, expected)

def test_pi_add_offset_n_gt1(self, box_transpose_fail):
@pytest.mark.parametrize("transpose", [True, False])
def test_pi_add_offset_n_gt1(self, box_with_array, transpose):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker, but transpose param is sub-optimal. For DataFrame case it will correctly test both cases, but for EA/Index/Series it will mean duplicate tests. I'll try to come up with something nicer in an upcoming arithmetic-test-specific pass

# GH#23215
# add offset to PeriodIndex with freq.n > 1
box, transpose = box_transpose_fail

per = pd.Period("2016-01", freq="2M")
pi = pd.PeriodIndex([per])

expected = pd.PeriodIndex(["2016-03"], freq="2M")

pi = tm.box_expected(pi, box, transpose=transpose)
expected = tm.box_expected(expected, box, transpose=transpose)
pi = tm.box_expected(pi, box_with_array, transpose=transpose)
expected = tm.box_expected(expected, box_with_array, transpose=transpose)

result = pi + per.freq
tm.assert_equal(result, expected)
Expand Down Expand Up @@ -984,16 +984,15 @@ def test_pi_add_sub_timedeltalike_freq_mismatch_monthly(self, mismatched_freq):
with pytest.raises(IncompatibleFrequency, match=msg):
rng -= other

def test_parr_add_sub_td64_nat(self, box_transpose_fail):
@pytest.mark.parametrize("transpose", [True, False])
def test_parr_add_sub_td64_nat(self, box_with_array, transpose):
# GH#23320 special handling for timedelta64("NaT")
box, transpose = box_transpose_fail

pi = pd.period_range("1994-04-01", periods=9, freq="19D")
other = np.timedelta64("NaT")
expected = pd.PeriodIndex(["NaT"] * 9, freq="19D")

obj = tm.box_expected(pi, box, transpose=transpose)
expected = tm.box_expected(expected, box, transpose=transpose)
obj = tm.box_expected(pi, box_with_array, transpose=transpose)
expected = tm.box_expected(expected, box_with_array, transpose=transpose)

result = obj + other
tm.assert_equal(result, expected)
Expand All @@ -1011,16 +1010,15 @@ def test_parr_add_sub_td64_nat(self, box_transpose_fail):
TimedeltaArray._from_sequence(["NaT"] * 9),
],
)
def test_parr_add_sub_tdt64_nat_array(self, box_df_fail, other):
def test_parr_add_sub_tdt64_nat_array(self, box_with_array, other):
# FIXME: DataFrame fails because when when operating column-wise
# timedelta64 entries become NaT and are treated like datetimes
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
box = box_df_fail

pi = pd.period_range("1994-04-01", periods=9, freq="19D")
expected = pd.PeriodIndex(["NaT"] * 9, freq="19D")

obj = tm.box_expected(pi, box)
expected = tm.box_expected(expected, box)
obj = tm.box_expected(pi, box_with_array)
expected = tm.box_expected(expected, box_with_array)

result = obj + other
tm.assert_equal(result, expected)
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/extension/base/reshaping.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,3 +295,19 @@ def test_ravel(self, data):
# Check that we have a view, not a copy
result[0] = result[1]
assert data[0] == data[1]

def test_transpose(self, data):
df = pd.DataFrame({"A": data[:4], "B": data[:4]}, index=["a", "b", "c", "d"])
result = df.T
expected = pd.DataFrame(
{
"a": type(data)._from_sequence([data[0]] * 2, dtype=data.dtype),
"b": type(data)._from_sequence([data[1]] * 2, dtype=data.dtype),
"c": type(data)._from_sequence([data[2]] * 2, dtype=data.dtype),
"d": type(data)._from_sequence([data[3]] * 2, dtype=data.dtype),
},
index=["A", "B"],
)
self.assert_frame_equal(result, expected)
self.assert_frame_equal(np.transpose(np.transpose(df)), df)
self.assert_frame_equal(np.transpose(np.transpose(df[["A"]])), df[["A"]])
4 changes: 4 additions & 0 deletions pandas/tests/extension/json/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,10 @@ def test_unstack(self, data, index):
# this matches otherwise
return super().test_unstack(data, index)

@pytest.mark.skip(reason="Inconsistent sizes.")
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
def test_transpose(self, data):
super().test_transpose(data)


class TestGetitem(BaseJSON, base.BaseGetitemTests):
pass
Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/extension/test_numpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,10 @@ def test_merge_on_extension_array_duplicates(self, data):
# Fails creating expected
super().test_merge_on_extension_array_duplicates(data)

@skip_nested
def test_transpose(self, data):
super().test_transpose(data)


class TestSetitem(BaseNumPyTests, base.BaseSetitemTests):
@skip_nested
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/frame/test_operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -891,6 +891,22 @@ def test_no_warning(self, all_arithmetic_operators):


class TestTranspose:
@pytest.mark.parametrize(
"ser",
[
pd.date_range("2016-04-05 04:30", periods=3, tz="UTC"),
pd.period_range("1994", freq="A", periods=3),
pd.period_range("1969", freq="9s", periods=1),
pd.date_range("2016-04-05 04:30", periods=3).astype("category"),
pd.date_range("2016-04-05 04:30", periods=3, tz="UTC").astype("category"),
],
)
def test_transpose_retains_extension_dtype(self, ser):
# case with more than 1 column, must have same dtype
df = pd.DataFrame({"a": ser, "b": ser})
result = df.T
assert (result.dtypes == ser.dtype).all()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round_trip = df.T.T
tm.assert_frame_equal(df, round_trip)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, the pd.date_range("2016-04-05 04:30", periods=3).astype("category") case fails that test. All the values are NaT.

I've xfalied it for now, and likely won't have time to look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks

def test_transpose_tzaware_1col_single_tz(self):
# GH#26825
dti = pd.date_range("2016-04-05 04:30", periods=3, tz="UTC")
Expand Down