Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: preserve EA dtype in transpose #30091

Merged
merged 20 commits into from
Dec 27, 2019
Merged

Conversation

TomAugspurger
Copy link
Contributor

No description provided.

@jbrockmendel
Copy link
Member

does this subsume #28048?

@@ -775,6 +775,7 @@ Reshaping
- Bug where :meth:`DataFrame.equals` returned True incorrectly in some cases when two DataFrames had the same columns in different orders (:issue:`28839`)
- Bug in :meth:`DataFrame.replace` that caused non-numeric replacer's dtype not respected (:issue:`26632`)
- Bug in :func:`melt` where supplying mixed strings and numeric values for ``id_vars`` or ``value_vars`` would incorrectly raise a ``ValueError`` (:issue:`29718`)
- Dtypes are now preserved when transposing a ``DataFrame`` where each column is the same extension dtyep (:issue:``)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtyep -> dtype

@TomAugspurger
Copy link
Contributor Author

does this subsume #28048?

Ah, yeah forgot about that one.

@jbrockmendel
Copy link
Member

Easy to forget about, i closed it to clear the queue and then only reopened it a couple days ago. I'm going to re-close it in favor or this one; maybe its worth salvaging some of the tests it implemented


# Slow, but unavoidable with 1D EAs.
new_values = []
for i in range(len(self)):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rethinking this approach. This results in n_rows * n_columns __getitem__s. My intent was to avoid going through a 2D object-dtype ndarray. But we're essentially doing that with lists. So I think it'll be better to just do .values.T and then rebuild the EAs from the object-dtype array.

kwargs.pop("copy", None) # by definition, we're copying
dtype = self._data.blocks[0].dtype
arr_type = dtype.construct_array_type()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this logic to pandas/core/reshape/reshape.py this has a lot of similiarity to _unstack_extension_series

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 6, 2019
if (
self._is_homogeneous_type
and len(self._data.blocks)
and is_extension_array_dtype(self._data.blocks[0].dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can avoid self._data references by making this len(self.dtypes) and is_extension_array_dtype(self.dtypes.iloc[0])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on 731 with self.dtypes

@TomAugspurger
Copy link
Contributor Author

Pushed a largeish refactor. We actually don't need NDFrame.transpose anymore. Series gets its from IndexOpsMixin. So I moved the logic to DataFrame.transpose, and was able to remove all the axes handling stuff.

maybe its worth salvaging some of the tests it implemented

Added.

if self._is_homogeneous_type and is_extension_array_dtype(self.iloc[:, 0]):
dtype = self.dtypes.iloc[0]
arr_type = dtype.construct_array_type()
values = self.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the single-column case could be done without this casting. think its worth special-casing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be done without casting in general. We'll still need to reshape the (N, 1) to (1, N).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im pretty sure i did this in the other PR, something like:

values = self._data.blocks[0].values
new_vals = [values[[n]] for n in range(len(values))]

(of course, if we had 2D EAs...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we can avoid casting to an ndarray by making N * P length-1 __getitem__ calls, which makes N * P extension arrays, which are concatenated into N final EAs.

My prior expectation is that converting to an ndarray and doing __getitem__ on that will be faster, and should have roughly the same amount of memory usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is P here? in the end this probably isn't worth bikeshedding (except to add to the pile of "reasons why EAs should support 2D")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, i was specifically referring to single-column

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with not-doing this optimization here, just want to make sure we're on the same page about what the available optimization is

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

*args, **kwargs
Additional arguments and keywords have no effect but might be
accepted for compatibility with numpy.
*args : tuple, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we still need kwargs for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, no. Neither np.transpose nor ndarray.transpose take additional keyword arguments.

pandas/core/frame.py Show resolved Hide resolved
@@ -644,50 +644,6 @@ def _set_axis(self, axis, labels):
self._data.set_axis(axis, labels)
self._clear_item_cache()

def transpose(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Series have transpose for compat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, via IndexOpsMixin.

pandas/core/frame.py Show resolved Hide resolved
@@ -2587,7 +2592,28 @@ def transpose(self, *args, **kwargs):
dtype: object
"""
nv.validate_transpose(args, dict())
return super().transpose(1, 0, **kwargs)
# construct the args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you think this is better located in pandas/core/reshape ? (and called as a helper function here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a slight preference for keeping it here just for readability. The reshape part is essentially just a list comprehension.

values = self.values
new_values = [arr_type._from_sequence(row, dtype=dtype) for row in values]

which I don't think warrants its own function. I don't see anything places in /core/reshape.py that could use this. I believe those are reshaping to / from 1-D things. This is 2D -> 2D.

But happy to move it if you want / if you see other places that could use parts of this.

@TomAugspurger
Copy link
Contributor Author

I think this is ready.

else:
new_values = self.values.T
if copy:
new_values = new_values.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if non-homogeneous, then new_values above is already a copy, can avoid re-copying here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we 100% sure about that? Or are there types distinct dtypes that .values can combine without copy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if non-homogeneous then we have multiple blocks, so multiple ndarrays that are going through np.c_ or something like it, right? AFAIK that has to allocate new data for the output. are there corner cases were missing @shoyer?

@@ -755,18 +755,18 @@ def test_pi_sub_isub_offset(self):
rng -= pd.offsets.MonthEnd(5)
tm.assert_index_equal(rng, expected)

def test_pi_add_offset_n_gt1(self, box_transpose_fail):
@pytest.mark.parametrize("transpose", [True, False])
def test_pi_add_offset_n_gt1(self, box_with_array, transpose):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker, but transpose param is sub-optimal. For DataFrame case it will correctly test both cases, but for EA/Index/Series it will mean duplicate tests. I'll try to come up with something nicer in an upcoming arithmetic-test-specific pass

df = pd.DataFrame({"a": ser, "b": ser})
result = df.T
assert (result.dtypes == ser.dtype).all()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round_trip = df.T.T
tm.assert_frame_equal(df, round_trip)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, the pd.date_range("2016-04-05 04:30", periods=3).astype("category") case fails that test. All the values are NaT.

I've xfalied it for now, and likely won't have time to look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks

@jbrockmendel
Copy link
Member

A handful of comments, generally looks good

@jreback jreback added this to the 1.0 milestone Dec 26, 2019
@jreback
Copy link
Contributor

jreback commented Dec 26, 2019

@TomAugspurger merge master when you have a chance (or @jbrockmendel if this is a blocker)

@jbrockmendel
Copy link
Member

rebased. not a blocker for the blockwise PRs

@jreback jreback merged commit cb5f9d1 into pandas-dev:master Dec 27, 2019
@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

thanks @TomAugspurger

@jbrockmendel I don't think we had a dedicated issue for this to be closed......

@jbrockmendel
Copy link
Member

#22120, since its about cyberpandas, i think we should ask tom to double-check that this fixes it.

AlexKirko pushed a commit to AlexKirko/pandas that referenced this pull request Dec 29, 2019
keechongtan added a commit to keechongtan/pandas that referenced this pull request Dec 29, 2019
…ndexing-1row-df

* upstream/master: (333 commits)
  CI: troubleshoot Web_and_Docs failing (pandas-dev#30534)
  WARN: Ignore NumbaPerformanceWarning in test suite (pandas-dev#30525)
  DEPR: camelCase in offsets, get_offset (pandas-dev#30340)
  PERF: implement scalar ops blockwise (pandas-dev#29853)
  DEPR: Remove Series.compress (pandas-dev#30514)
  ENH: Add numba engine for rolling apply (pandas-dev#30151)
  [ENH] Add to_markdown method (pandas-dev#30350)
  DEPR: Deprecate pandas.np module (pandas-dev#30386)
  ENH: Add ignore_index for df.drop_duplicates (pandas-dev#30405)
  BUG: The setting xrot=0 in DataFrame.hist() doesn't work with by and subplots pandas-dev#30288 (pandas-dev#30491)
  CI: Fix GBQ Tests (pandas-dev#30478)
  Bug groupby quantile listlike q and int columns (pandas-dev#30485)
  ENH: Add ignore_index for df.sort_values and series.sort_values (pandas-dev#30402)
  TYP: Typing hints in pandas/io/formats/{css,csvs}.py (pandas-dev#30398)
  BUG: raise on non-hashable Index name, closes pandas-dev#29069 (pandas-dev#30335)
  Replace "foo!r" to "repr(foo)" syntax pandas-dev#29886 (pandas-dev#30502)
  BUG: preserve EA dtype in transpose (pandas-dev#30091)
  BLD: add check to prevent tempita name error, clsoes pandas-dev#28836 (pandas-dev#30498)
  REF/TST: method-specific files for test_append (pandas-dev#30503)
  marked unused parameters (pandas-dev#30504)
  ...
@TomAugspurger TomAugspurger deleted the ea-transpose branch February 28, 2020 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants