-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for future copy / view semantics in indexing operations #36195
Comments
asv from a branch that changes column-based indexing to always return a view:
Methodological note: i did a full asv run, then for everything that came back as changed re-ran those and changes that were not reproducible (e.g. tslibs benchmarks which have to be noise) |
I'd like to move forward with this, including a version of the proposal on https://pandas.pydata.org/pandas-docs/dev/development/roadmap.html. There hasn't been much activity on the google doc, https://docs.google.com/document/d/1csGE4qigPR2vzmU2--jwURn3sK5k5eVewinxd8OUPk0/edit. The biggest outstanding item is a decision / discussion of "Copy on Write" vs. "Error on Write". It'd be nice if the proposal was explicit on this point, but I worry we won't be able to decide without actually trying it out. |
Some assorted thoughts I gathered the last weeks (not necessarily all needing an answer in an initial discussion / proposal, though):
|
I think the crucial question that deserves more discussion is: are we
fine with ditching the SettingWithCopyWarning for those cases it was
introduced for?
Because there is a reason that it was introduced: to warn uses that
were doing a setitem operation that might have no effect silently, and thus
be confusing / easy to miss for users.
I think the classical example is something like df[df['B'] > 4]['B'] =
10 - a chained indexing assignment that doesn't work. The warning was
introduced to prevent people doing that and silently having no effect.
Are we fine with it that this will indeed have silently no effect?
(which would be the consequence of copy-on-write)
For me, this is one of the compelling arguments to go with "Error on
Write". Then we are forcing people to be explicit about their intent,
instead of depending on a behavior that is inconsistent (today) and for
which we might not figure out all of the boundary cases if we chose
"copy-on-write". Also, I've seen the following behavior with people I've
worked with:
1) Ignoring the SettingWithCopyWarning, passing that code on to someone
else who then has to deal with it
2) Writing a function that works perfectly fine in test scenarios with data
read from a CSV file, but then handing off that function to someone else to
use, who passes a different DataFrame that was constructed via some
filtering operation, which then causes the warning to be issued.
3) Writing some code that is then passed to a totally different team, which
treats all warnings as errors, in which case the code has to be fixed
If we silently do copy-on-write, I'm concerned that these types of issues
wouldn't get found by people using pandas.
- Dr. Irv
|
Agreed that the classic chained indexing example is a compelling argument for Error on Write rather than Copy on Write.
IMO, yes that should create two views to the same data.
I'd like to keep this focused just on setting data / values. I'm hopeful that the metadata / flags issue can be solved by doing shallow copies by default and providing an option to copy the data if desired, like we did for
If possible, treating it the same as a DataFrame seems best to me. |
is there any system that does error-on-write? I do prefer this, but am a little concerned that we are treading new group vs. copy-on-write (which I think has a large body of systems / literature). |
Putting it here because most of the discussion on this topic is still here (but can also move to #36988). While the classic chained indexing example is indeed an argument for Error-on-Write (for this case, it's basically what we have now but with an error instead of warning), I think there are also arguments/examples to give in favor of Copy-on-Write:
In this specific case, having consistent Copy-on-Write could also "solve" it? If is provides consistent behaviour (always copy, never mutate the passed data), not depending on the actual data passed to the function. |
Assume that we could actually detect chained indexing? In this hypothetical situation, would people then be more in favor of Copy-on-Write over Error-on-Write? The assumption would be that we can detect the difference in
vs
I am wondering if we shouldn't be able to actually detect this based on reference counting. In the first case, there is no additional reference to |
It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through. The other issue for me is that copy-on-write silently does something, which might have side effects that a user would like to be made aware of, whereas error-on-write makes the user very aware of what he is doing. |
Isn't the same true for Error-on-Write? We also need to detect all cases where we need to error instead of allow the mutation. |
@jorisvandenbossche you have just described the current implementation of SettingCopyWarning and it's entire purpose it is tracking refs but i don't actually think this is that robust - and thats the same reason i am not enthusiastic about copy on write yes if we can guarantee that we can always do it great but that's the problem i am not sure we really can though maybe now that we have full control over PandasArrays it might be fully possible (way back when when i did SettingWithCopy) some ops just indexes directly to the numpy arrays and we had to be very defensive about this today we have much more robust handling |
Yes, I agree. But if we miss a case, we fix it by raising an exception. With copy-on-write, we fix it by doing the copy, but I'm guessing that there might be cases where we say "Given this case, we should have chosen error-on-write". My sense is that it is easier to go from the current state of |
Just to clarify the choice, is it fair to say that the only real choice is between
The initial proposal (and my PR) left out the warning specifically for chained indexing part. If we can reliably detect chained indexing (a big if?) then do people prefer option 2? |
@jreback It might be achieved using the same mechanism (I would need to look more into the code of this, it's not a part I am very familiar with), but at least the goal I described is different: the current SettingWithCopyWarning warns in both example cases, while I want only the first (actual chained indexing) to raise, and not the second (assigning the subset to a variable).
@Dr-Irv I don't fully understand your last sentence. |
Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views) |
I prefer this over the status quo. |
No, I mean let's say that we choose copy-on-write. We think we get it right. We release the code. Someone discovers a place where we didn't get it right, and then we discuss it, and realize we should have chosen "error-on-write" in the first place. I'm just erring on the side of caution by preferring error-on-write. |
OK, understand it now ;) (and to be clear: while I am mainly arguing here for Copy-on-Write to give some counter-arguments, I am not yet sure myself which of the two is "best". I mainly think it's something to futher explore/experiment/discuss)
I also prefer that over status-quo. But, @jbrockmendel, do you also prefer that over the copy-on-write/error-on-write alternatives discussed? (personally, I think that I am currently in favor of one of those two other options) |
Here's something that I brought up in the October, 2020, dev meeting. Consider: >>> df = pd.DataFrame({"x":[1,2,3], "y":[4,5,6]})
>>> df
x y
0 1 4
1 2 5
2 3 6
>>> sx=df['x']
>>> sx.iloc[1] = 10
>>> sx
0 1
1 10
2 3
Name: x, dtype: int64
>>> df
x y
0 1 4
1 10 5
2 3 6
>>> sdf = df[['x']]
>>> type(sdf)
<class 'pandas.core.frame.DataFrame'>
>>> sdf
x
0 1
1 10
2 3
>>> sdf.loc[1,'x'] = 20
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value)
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> sdf
x
0 1
1 20
2 3
>>> df
x y
0 1 4
1 10 5
2 3 6 Right now, the behavior of picking a single column using the syntax On the other hand, if we use error-on-write, then in BOTH cases, we raise an error, which makes it clear to the user that they need to make their intent clear. |
Do either the CoW or EoW allow for the possibility that when i edit a view I am doing so intentionally? |
By "doing so intentionally", you mean modifying the underlying |
Yes. A case where I want to modify both. |
Yes, the proposal has a |
Sorry, I'm missing the difference between these "clear rules" and the "whenever possible", could you clarify? I'm also missing one thing in the proposal: how can something like A syntax I would understand would be something like |
Indexing columns would always give views. The difference comes when you try to mutate the view >>> df2 = df1[["A']] # df2 is a view on df1
>>> df2.iloc[:, :] = 0 # Copies (with Copy on Write) or raises (with Error on Write). And for the case of "I want to mutate
I think this discussion is limited to
If that sounds right, I'll clarify it in the proposal document. |
Hi all. I'm a long-time pandas user, and I've been teaching pandas to beginner/intermediate data science students since 2015. Much of that teaching is online, and as a result I've answered hundreds (or perhaps thousands) of pandas questions from newer users. That is all to say that I have somewhat of a pulse on what newer pandas users ask about. I read through the copy-on-write proposal, and I wanted to share a few reactions to the proposal:
I hope some of this is helpful to the discussion! |
Hello maintainers. Thank you for putting out a thoughtful proposal and getting more community feedback in the process. I echo all of justmarkham's comments above, and left my comments in the google doc. the tl;dr:
Here's an example of what things look like from the R world of copy-on-modify: https://gist.github.com/chendaniely/53c2ea975a61bd2f05ef2aca46f66062 |
Sorry for the double post, but I updated the above gist for what "chained indexing" looks like from the R world (if that helps any) The direct file is here: https://gist.github.com/chendaniely/53c2ea975a61bd2f05ef2aca46f66062#file-chained_indexing-base_tidyverse-r |
@justmarkham @chendaniely Thanks a lot for the feedback, that's really valuable. I am personally happy to hear that you both think the current proposal would be more intuitive for (novice) users than the current situation (since this was one of my goals with the proposal). I also think that we should not assume our users to know (numpy) concepts of copy vs view in indexing.
So chained assignment actually works in R! There is one aspect related to this brought up by @shoyer on the mailing list: we could make one exception on the copy-on-write rule, i.e. accessing a single column of a DataFrame as a series would still be a view, any other indexing operation would follow copy-on-write as proposed. I recently commented about this on the mailing list at https://mail.python.org/pipermail/pandas-dev/2021-October/001395.html This would ensure that one type of chained assignment (and possibly the most common one) keeps working:
Yes, that's a good point and something I should explore in more detail (and add a section about in the document). I think in general for the majority of workflows, the performance will either stay the same or improve (because of avoiding unnecessary copies). There will be specific cases that might be impacted though:
|
A little cross promotion: https://youtu.be/aBeEN2klZQE |
This can't come soon enough. (Background: I teach 50-100+ folks per month on Pandas and wrote Learning the Pandas Library, Pandas Cookbook 2nd Ed, and Effective Pandas. I also consult on Pandas.) Happy to give a prototype a run through on some of my Pandas material. However, I never hit our friend As far as some code breaking, IMO you should release the next minor version before this gets pushed updating the deprecation warning to state that certain behavior will be removed, and then push a minor (major?) version with the fix. WRT "the breaking away from NumPy behavior" argument preventing this. No one I teach (or have consulted with) cares about that. Most of them don't use NumPy (and don't need to). Seems like a weird implementation detail to prevent such a useful update. Again, this can't come soon enough for those who haven't embraced chaining. |
@jorisvandenbossche Out of curiosity, while I recognize that moving to copy on view is technically a breaking change, are you aware of any real-world examples where people are relying on pandas returning views in their code? The rules around when pandas returns views are so idiosyncratic and sensitive to apparently-unrelated code changes that I've never felt remotely safe using views, and am curious if anyone else ever does... |
I just wanted to add a huge - yes please - how can I help - when can we have this! And @nickeubank - I agree - I never feel safe deliberately using views. I guess the problem is that a lot of code doesn't deliberately use them, but uses them by accident. I did offer to do a survey to find out if that happened in real life, in #36195 (comment) - but I would in general love to know what it would take to get this in. |
I think that any case of chained assignment that currently works (intentionally, or somewhat by accident) is an example using a view that for sure people do rely on. Apart from that, I fully agree that it's probably not common that people create an actual subset dataframe that is a view, and mutate that with the intention to also mutate the original dataframe (which is for me a good reason to not do this, because this is not what people expect, and in a large majority of cases also not what people want). And to be clear, the chained assignment comprises both the typical case of doing something like |
Yeah, ok, I can see that! The compactness of the pattern means all the weird ways a view can "break" don't come into play, and so I can see how it would be frequently used naively. Thanks! |
About the "what would it take to get this in / how can I help": First, it would actually be good to get some more explicit support from other pandas core devs (it's hard to estimate to what extent there is consensus around the latest proposal). But apart from that:
So on the short term helping with with reviewing or some extra implementation tasks is most needed, but certainly also not super approachable to help with. |
Update: concrete proposal for Copy-on-Write: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing
[original post]
Hi all,
During the recent sprint we discussed copy / view semantics in indexing operations. I've tried to summarize a proposal that came out of it.
The tldr is
The full proposal is available at https://docs.google.com/document/d/1csGE4qigPR2vzmU2--jwURn3sK5k5eVewinxd8OUPk0/edit. I'd like to collect comments there and then propose properly by adding it to the roadmap.
cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: