Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/TST: add tests for new copy/view behaviour #46979

Merged
merged 18 commits into from
May 31, 2022

Conversation

jorisvandenbossche
Copy link
Member

This is broken off from #46958 / #41878

This are the new tests that I wrote for the PRs implementing the proposed new copy/view semantics with Copy-on-Write (#36195). They are intended to be readable tests to follow and review the intended semantics.
They will partly be duplicating some other tests that also deal with copy/view details spread around in the test suite (see the required test edits in the abovementioned PRs), but I would say this is fine, as I think it is good to have a set of centralized tests focused on those copy/view semantics.

I broke of this part from the main PR because I think this part can be done as a pre-cursor (reducing the diff size for the main PR), and because it might be useful to review those tests as a way to review the proposed behaviours (cc @pandas-dev/pandas-core).

Some practical questions about the test organization:

  • Does the general style of the tests look OK?
  • I currently put those tests in /tests/indexing/test_copy_on_write.py, but 1) it's not only about indexing, and 2) "copy_on_write" is more the internal implementation mechanism, while the user facing thing we are testing is copy/view behaviour. So maybe I could put them in a new top-level tests/copy_view/ test directory? (which also allows more easily to split it in multiple files)

@rhshadrach
Copy link
Member

Tests look good to me. As a very small nit on style, I think most tests are using from pandas import DataFrame instead of pd.DataFrame, etc.

+1 on tests/copy_view

@jorisvandenbossche jorisvandenbossche added this to the 1.5 milestone May 20, 2022
@jorisvandenbossche
Copy link
Member Author

The two failing builds seem unrelated conda installation failures.

@jreback
Copy link
Contributor

jreback commented May 21, 2022

cc @mroeschke if you can have a look

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 2 questions:

  1. Might have missed it, but is it worth testing subsetting by row and column (though it should already copy)
  2. I see you're mainly testing with integer dtype, just confirming that this behavior should be dtype independent?

@jorisvandenbossche
Copy link
Member Author

Might have missed it, but is it worth testing subsetting by row and column (though it should already copy)

That was indeed not yet explicitly tested. I added an additional test for subsetting with loc/iloc with both a column and row indexer (parametrized for a few different kind of indexers), and then generally testing that mutating the subset will not change the parent (indeed depending on the exact indexer, the result will actually already be a copy and will never update the parent anyway, but it is still an easy generic test to add).

I see you're mainly testing with integer dtype, just confirming that this behavior should be dtype independent?

Yeah, good question. In general it depends on the test I think. For tests that are ensuring that getitem operations are properly tracking references, I think it shouldn't matter in theory. What does matter is single vs multiple blocks, as that can take a different code path, and so I already parametrized a few of the tests to use homogeneous dtype vs mixed dtypes (I should probably expand this to more of the tests).
The tests checking several setitem paths probably do depend on the dtype (or for example at least on numpy dtype vs extension dtype, since the Block implements some aspects of setitem). It would be good to expand the coverage of those tests, but on the other hand I also want to keep those tests as readable as possible (as kind of additional source to understand the intended behaviour). I was thinking that we could add some additional tests that are heavily parametrized (and less readable) and increase coverage, while keeping the current explicit tests as is? (and if that sounds good, that's also something that could be done in a follow-up PR)

@jorisvandenbossche
Copy link
Member Author

I also added a helper function to do the equivalent of df["col"].values, but without going through the public __getitem__ API. Since that triggers a reference to be tracked to the parent dataframe from the resulting series, that might interfere with the operation that we are actually testing (eg when we are testing that taking a subset in some other way properly tracks references to that column)

@mroeschke
Copy link
Member

Thanks, yeah a follow up PR testing over more dtypes would be great.

One typing issue otherwise LGTM

pyright..................................................................Passed
mypy.....................................................................Failed
- hook id: mypy
- exit code: 1

pandas/tests/copy_view/test_methods.py:15: error: Unused "type: ignore" comment
pandas/tests/copy_view/test_methods.py:29: error: Unused "type: ignore" comment
Found 2 errors in 1 file (checked 1384 source files)

@jorisvandenbossche
Copy link
Member Author

Any more comments here?

@mroeschke mroeschke merged commit b913935 into pandas-dev:main May 31, 2022
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the cow-tests branch May 31, 2022 22:07
@jorisvandenbossche
Copy link
Member Author

And thanks for the review!

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
* API/TST: add tests for new copy/view behaviour

* fixes for non-CoW cases

* move test file to tests/copy_view

* split into multiple files

* Fix tests for ArrayManager case

* fix type checking

* may_share_memory -> shares_memory

* update pandas imports

* add test for iloc/low with row+column indexer

* user helper to get column values to assert np.shares_memory

* fix typing

* handle new iloc inplace deprecation warnings

* handle new iloc deprecation warnings for ArrayManager
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants