Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix aliasing detection in sort! #2981

Merged
merged 7 commits into from
Jan 9, 2022
Merged

fix aliasing detection in sort! #2981

merged 7 commits into from
Jan 9, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jan 6, 2022

The problem in DataFrames.jl 1.3.1 is:

julia> using DataFrames

julia> df = DataFrame(a=1:5, b=11:15, c=21:25)
5×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1     11     21
   2 │     2     12     22
   3 │     3     13     23
   4 │     4     14     24
   5 │     5     15     25

julia> sdf = view(df, [1, 2, 3], :)
3×3 SubDataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1     11     21
   2 │     2     12     22
   3 │     3     13     23

julia> sort!(sdf)
ERROR: ArgumentError: data frame contains non identical columns that share the same memory

This is caused by incorrect aliasing detection rule.

This PR fixes this.

@bkamins bkamins requested a review from nalimilan January 6, 2022 12:10
@bkamins bkamins added the bug label Jan 6, 2022
@bkamins bkamins added this to the patch milestone Jan 6, 2022
@bkamins
Copy link
Member Author

bkamins commented Jan 6, 2022

Maybe one more note. By doing stripping of SubDataFrame we make sure that the following that worked:

julia> x = [1:6;]
6-element Vector{Int64}:
 1
 2
 3
 4
 5
 6

julia> df = DataFrame(a=view(x, 1:5), b=view(x, 6:-1:2), copycols=false)
5×2 DataFrame       
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      6 
   2 │     2      5 
   3 │     3      4 
   4 │     4      3 
   5 │     5      2 

julia> dfv = view(df, 1:3, [2])
3×1 SubDataFrame
 Row │ b        
     │ Int64    
─────┼───────   
   1 │     6    
   2 │     5    
   3 │     4    

julia> sort!(dfv)
3×1 SubDataFrame
 Row │ b        
     │ Int64    
─────┼───────   
   1 │     4    
   2 │     5    
   3 │     6    

julia> df # note what happend in column a that was not in dfv
5×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6
   4 │     6      3
   5 │     5      2

is now an error (so now we disallow non-identical column aliasing in parent data frame - not only in the view; I think it is OK and safer).

@nalimilan
Copy link
Member

Good catch. So the problem mentioned in the description is due to this, right? I'm not sure whether that's intended or not in Base (I see a comment "# We cannot do any better than the usual dataids check" there).

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> Base.mightalias(view([1, 2, 3], x), view([1, 2, 3], x))
true

The problem in your last comment seems to be more general, and not even specific to DataFrames: aliased vectors can give weird results in many occasions. Is sort! the only place where this kind of problem can happen? Maybe also with permute! and setindex!?

@bkamins
Copy link
Member Author

bkamins commented Jan 7, 2022

I'm not sure whether that's intended or not in Base

In Base for views it checks not only that the data does not alias but also that the vector specifying subset of rows to be picked does not alias. So you have that two completely different vectors that share indexing vectors are considered to be aliased.
This is a "safe" strategy, although in general I would assume that indexing vector for a view should never be mutated.

Is sort! the only place where this kind of problem can happen?

sort! is the only function in DataFrames.jl that reorders rows in-place on AbstractDataFrame. The other with similar effect is deleteat!, but it requires DataFrame.

We do not define permute! on AbstractDataFrame.

For setindex! it can happen, but at least rows are not reordered + we need setindex! to be fast.


On the other hand we could drop this aliasing check altogether as the problem can happen in super rare cases (someone puts two views of the same vector with different offset). This would save us doing the check that has quadratic cost in number of columns.

@nalimilan
Copy link
Member

In Base for views it checks not only that the data does not alias but also that the vector specifying subset of rows to be picked does not alias. So you have that two completely different vectors that share indexing vectors are considered to be aliased. This is a "safe" strategy, although in general I would assume that indexing vector for a view should never be mutated.

Yes, ideally we would have a way to check only whether the contents alias, ignoring indices. But that's probably not easy.

On the other hand we could drop this aliasing check altogether as the problem can happen in super rare cases (someone puts two views of the same vector with different offset). This would save us doing the check that has quadratic cost in number of columns.

We still have to check whether some columns are equal, but that's probably less costly to do. So maybe we could drop the mightalias check indeed. If you have created a data frame with columns that alias each other without being equal, you have more serious problems than sorting -- and it's too costly to check on creation. :-)

@bkamins
Copy link
Member Author

bkamins commented Jan 7, 2022

Yes, ideally we would have a way to check only whether the contents alias, ignoring indices. But that's probably not easy.

This is what we essentially do now (as we are comparing parents)

So maybe we could drop the mightalias check indeed.

This is what I will do and will make a note in release notes.

@bkamins
Copy link
Member Author

bkamins commented Jan 7, 2022

@nalimilan - I have fixed the approach. Now we only consider columns passing === test as aliases.
A side benefit is that sorting very wide data frames is now significantly faster (for 10,000 columns the overhead for anti-aliasing checking was over 1 second).

src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
test/sort.jl Outdated Show resolved Hide resolved
test/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
bkamins and others added 3 commits January 7, 2022 22:08
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2022

@nalimilan - can you please have a look again at this PR before I merge it? I have marked several tests as broken (previously they threw an error because of more aggressive aliasing detection; now they pass producing a wrong result)

@bkamins bkamins merged commit 535e731 into main Jan 9, 2022
@bkamins bkamins deleted the bk/fix_sorting_aliasing branch January 9, 2022 13:17
@bkamins
Copy link
Member Author

bkamins commented Jan 9, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants