Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: avoid going through .values for reindexing both index/columns for ArrayManager #44798

Conversation

jorisvandenbossche
Copy link
Member

xref #39146

This fastpath calls self.values and thus is only a fastpath for BlockManager (we only get here with all columns having the same dtype).

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance ArrayManager labels Dec 7, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.4 milestone Dec 7, 2021
@jbrockmendel
Copy link
Member

I've been looking at this code path as, for reasons related to the CI failures, it appears fragile. e.g. in TestDataFrameSelectReindex.test_reindex_datetimelike_to_object setting df["b"] = 9 (following df = ser.unstack()) breaks similarly. Another degree of separation xref #42921.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 7, 2021

It also seems that our deprecation is not working if you only reindex the rows (and not the rows/columns together):

In [1]: arr = date_range("2016-01-01", periods=6).values.reshape(3, 2)
   ...: df = DataFrame(arr, columns=["A", "B"], index=range(3))
   ...: ts = df.iloc[0, 0]
   ...: fv = ts.date()

In [2]: df.reindex(index=range(4), columns=["A", "B", "C"], fill_value=fv)
<ipython-input-2-fdba941b8324>:1: FutureWarning: Using a `date` object for fill_value with `datetime64[ns]` dtype is deprecated. In a future version, this will be cast to object dtype. Pass `fill_value=Timestamp(date_obj)` instead.
  df.reindex(index=range(4), columns=["A", "B", "C"], fill_value=fv)
Out[2]: 
           A          B          C
0 2016-01-01 2016-01-02 2016-01-01
1 2016-01-03 2016-01-04 2016-01-01
2 2016-01-05 2016-01-06 2016-01-01
3 2016-01-01 2016-01-01 2016-01-01

In [3]: df.reindex(index=range(4), fill_value=fv)
...
TypeError: value should be a 'Timestamp' or 'NaT'. Got 'date' instead.

The second should also work and trigger the deprecation warning (the tests here are running into this because I am now basically taking the reindex path in separate steps for AM).

Which is probably indeed the same as #42921

@jbrockmendel
Copy link
Member

It also seems that our deprecation is not working

IIRC this code path goes through maybe_promote, which doesn't quite match the DTA behavior. There was some work a few months ago that brought them closer into alignment, but off the top of my head I don't remember how far that got.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

seems reasonable can you merge master

@jreback jreback removed this from the 1.4 milestone Dec 27, 2021
if (
row_indexer is not None
and col_indexer is not None
and not isinstance(self._mgr, ArrayManager)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think a more general condition would be self._can_fast_transpose. The thing we care about is self.values being cheap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche can you address comments and rebase

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants