Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: DataFrame.take always returns a copy #30784

Merged
merged 12 commits into from
Jan 27, 2020

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jan 7, 2020

Closes #27357

This adds an internal version of take with the behaviour of setting _is_copy for DataFrames that the public take did before, so this version can be used in the indexing code (where we want to keep track of parent dataframe with _is_copy for SettingWithCopyWarnings).
I named it _take_with_is_copy which is literally what it is doing, but happy to hear alternatives.

This then updates the deprecation to fully ignore the keyword and indicate in the deprecation message the keyword has no effect anymore.

I checked #27349 and #30615 to ensure that where previously the internal version was used or is_copy was specified, now the appropriate function is used. I suppose that in some of the cases where I now use the internal _take_with_is_copy this is not actually needed, but it's the safest thing anyway (it will do the same as it did before).

@@ -3370,14 +3368,7 @@ class max_speed
new_data = self._data.take(
indices, axis=self._get_block_manager_axis(axis), verify=True
)
result = self._constructor(new_data).__finalize__(self)

# Maybe set copy if we didn't actually change the index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if its relevant, but DTI.take sometimes returns a view

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case the indices can represent a slice, right?
Personally I don't think we should make that optimization, as IMO it is much clearer for take to always be a copy, but I suppose for Index which is supposed to be immutable it might matter less.

@jorisvandenbossche
Copy link
Member Author

So the reason that this is (expectedly) failing, is due to missing SettingWithCopyWarnings. I suppose we are internally relying on the current take behaviour to set _is_copy in the indexing code, where it is less clear if the result is a copy of not (the reason to have those SettingWithCopyWarnings).
So by making take always be a clean copy (no weakref saved in _is_copy), this no longer works as expected for the indexing code.

So if we want to go through with this change, we would need to track all internal usage of take in such indexing situations and update that (or have an internal take version with this old behaviour).

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review January 23, 2020 14:33
@jorisvandenbossche jorisvandenbossche added API Design Deprecate Functionality to remove in pandas labels Jan 23, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.0 milestone Jan 23, 2020
return result
See the docstring of `take` for full explanation of the parameters.
"""
return self.take(indices=indices, axis=axis, **kwargs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am including a Series._take_with_is_copy to override the generic one. This is because the defaults for the old is_copy kwargs were different for Series (False) and DataFrame (True).

I am not fully sure if the code paths that use _take_with_is_copy actually can have a Series calling it, but included this to be sure.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your _take_with_is_copy makes sense, thanks.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test that asserts that no SettingWithCopy warning is raised by .asof and .sample (e.g. the test cases that had to change and your example)

@jorisvandenbossche
Copy link
Member Author

Added tests for sample and asof

@jreback
Copy link
Contributor

jreback commented Jan 26, 2020

looks fine. needs a rebase. can you run some approporiate asv's to make sure doesn't hit perf much.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 26, 2020

The specific asv benchmark that had a regression because of this, indicates it is fixed:

N = 10000
M = 10
rng = pd.date_range(start="1/1/1990", periods=N, freq="53s")
data = pd.DataFrame(np.random.randn(N, M), index=rng)
dates = date_range(start="1/1/1990", periods=N * 10, freq="5s")

0.25.3:

In [2]: %timeit data.asof(dates) 
3.9 ms ± 768 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

master:

In [2]: %timeit data.asof(dates)
14 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

PR:

In [2]: %timeit data.asof(dates) 
3.93 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jreback jreback merged commit a9b61a9 into pandas-dev:master Jan 27, 2020
@jreback
Copy link
Contributor

jreback commented Jan 27, 2020

great thanks @jorisvandenbossche

@lumberbot-app
Copy link

lumberbot-app bot commented Jan 27, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.0.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 a9b61a9c33f97554cf3dde6272644fc985db8dfb
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #30784: API: DataFrame.take always returns a copy'
  1. Push to a named branch :
git push YOURFORK 1.0.x:auto-backport-of-pr-30784-on-1.0.x
  1. Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #30784 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2020

@jorisvandenbossche if you can backport this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEPR: .take(..., is_copy=True); rename is_copy -> copy
5 participants