Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copying of columns in select! and transform! #2978

Closed
bkamins opened this issue Dec 31, 2021 · 4 comments · Fixed by #2983
Closed

copying of columns in select! and transform! #2978

bkamins opened this issue Dec 31, 2021 · 4 comments · Fixed by #2983
Labels
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Dec 31, 2021

Is this behavior expected:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1 
   2 │     2 
   3 │     3 

julia> select!(df, :a => :b, :a => :c)
3×2 DataFrame       
 Row │ b      c     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1 
   2 │     2      2 
   3 │     3      3 

julia> df.b === df.c
true

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> transform!(df, :a .=> [:b])
3×2 DataFrame       
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1 
   2 │     2      2 
   3 │     3      3 

julia> df.a === df.b
true

Note that we recently made the following change:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> transform!(df, :a => :b)
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3

julia> df.a === df.b
false
@bkamins bkamins added the bug label Dec 31, 2021
@bkamins bkamins added this to the patch milestone Dec 31, 2021
@nalimilan
Copy link
Member

Woops. The transform! case is clearly a bug (we never want columns to alias each other). The situation is more tricky for select!: it would make sense to avoid a copy for the first renaming operation, but creating aliases when renaming a column multiple times is problematic. Given that the latter is less common than the former, it may be too radical to always make a copy. OTOH rules like "a copy is made only if the column is renamed multiple times" may be too complex for users.

IIRC we discussed this in depth already some time ago, right?

@bkamins
Copy link
Member Author

bkamins commented Jan 1, 2022

IIRC we discussed this in depth already some time ago, right?

Yes we did, and I was advocating not to do to many checks for performance reasons. The PR was #2721 so probably we discussed on Slack.

So we have two potential rules for select! and transform! (I would try to keep the same rules for them):

  1. "renaming makes a copy" -> this is easy to implement
  2. "a copy is made only if the column is renamed multiple times" -> this is doable, but a bit more complex (also note that just writing :a or [:a] uses the column without coping and these operations are not renaming so the rule wording would need to be more complex)

Towards which do you lean more?

@nalimilan
Copy link
Member

This is tricky. I don't think 1. is acceptable as it would make select! much slower than it should (and it's supposed to be an in-place operation so you wouldn't expect copying in simple cases). So I guess I would lean towards 2; do you mean that all uses would trigger a copy, or would you skip the first use?

@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2022

Just to clarify options.

Option 1:

  • :x or [:x] or any column selector does not make copy
  • :x => :y etc. always makes a copy

Option 2:

  • we track all column selectors (:x, [:x] etc.) and all column renaming operations (:x => :y etc.) then no copy is made if the :x column is used for the first time and copy is made for consecutive occurrences.

How this will be handled. We will have a bool vector signaling if a column was already used or not having as many elements as there are columns. Initially it will have false and then no copying will happen. After first use it will be turned to true and copying will be performed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants