-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement faster innerjoin #2612
Conversation
Thanks for working on this. Any reason not to reuse our grouping code? Optimizations matter quite a lot for |
I was not clear how to take advantage of these optimizations. The reason is that even if e.g. |
I have added a fast join path for sorted tables. Unfortunately it cannot be used with |
OK. So here are the timings. The conclusion is:
@nalimilan - what do you think we should do? smaller datacurrent main:
PR:
bigger datacurrent main:
PR:
|
Yeah, the comparison between pools is still an annoying problem and I haven't tried implementing the global table to fix that. At least it shouldn't be hard to check whether
What's your question exactly? If I understand correctly, this PR is always as fast as main or faster, so I have nothing to object. :-) Something which would be worth benchmarking is joining on multiple columns. I think it's the case where hashing columns one by one like Here's another run of your benchmarks (on a Xeon 4114 at 2.20GHz): smaller datamain: julia> using Random, DataFrames
julia> Random.seed!(1234);
julia> df1 = DataFrame(id = sort!(string.(1:10^6)));
julia> df2 = DataFrame(id = sort!(string.(1:10^7)));
julia> @time innerjoin(df1, df2, on=:id);
4.712898 seconds (2.43 M allocations: 843.429 MiB, 23.98% gc time, 40.77% compilation time)
julia> @time innerjoin(df2, df1, on=:id);
1.980586 seconds (194 allocations: 252.524 MiB)
julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));
julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));
julia> @time innerjoin(df1, df2, on=:id);
4.824913 seconds (194 allocations: 707.847 MiB, 45.99% gc time)
julia> @time innerjoin(df2, df1, on=:id);
3.083552 seconds (194 allocations: 252.524 MiB)
julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));
julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));
julia> @time innerjoin(df1, df2, on=:id);
1.624032 seconds (196 allocations: 667.021 MiB, 15.88% gc time)
julia> @time innerjoin(df2, df1, on=:id);
1.563919 seconds (196 allocations: 262.201 MiB, 9.95% gc time)
julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));
julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));
julia> @time innerjoin(df1, df2, on=:id);
5.085381 seconds (196 allocations: 666.978 MiB, 37.59% gc time)
julia> @time innerjoin(df2, df1, on=:id);
2.794582 seconds (196 allocations: 262.180 MiB) PR: julia> using Random, DataFrames
julia> Random.seed!(1234);
julia> df1 = DataFrame(id = sort!(string.(1:10^6)));
julia> df2 = DataFrame(id = sort!(string.(1:10^7)));
julia> @time innerjoin(df1, df2, on=:id);
1.964033 seconds (1.70 M allocations: 122.239 MiB, 19.72% gc time, 65.16% compilation time)
julia> @time innerjoin(df2, df1, on=:id);
0.357172 seconds (154 allocations: 22.900 MiB)
julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));
julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));
julia> @time innerjoin(df1, df2, on=:id);
3.679063 seconds (257.89 k allocations: 105.835 MiB, 5.86% compilation time)
julia> @time innerjoin(df2, df1, on=:id);
3.884744 seconds (246 allocations: 90.813 MiB)
julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));
julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));
julia> @time innerjoin(df1, df2, on=:id);
0.404584 seconds (154 allocations: 22.133 MiB)
julia> @time innerjoin(df2, df1, on=:id);
0.409299 seconds (154 allocations: 22.133 MiB)
julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));
julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));
julia> @time innerjoin(df1, df2, on=:id);
5.682937 seconds (1.27 M allocations: 146.871 MiB)
julia> @time innerjoin(df2, df1, on=:id);
6.104373 seconds (1.27 M allocations: 146.871 MiB) bigger datamain: julia> using Random, DataFrames
julia> Random.seed!(1234);
julia> df1 = DataFrame(id = sort!(string.(1:10^6)));
julia> df2 = DataFrame(id = sort!(string.(1:10^8)));
julia> @time innerjoin(df1, df2, on=:id);
46.291461 seconds (194 allocations: 6.260 GiB, 42.79% gc time)
julia> @time innerjoin(df2, df1, on=:id);
28.256333 seconds (194 allocations: 1.588 GiB, 21.50% gc time)
julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));
julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));
julia> @time innerjoin(df1, df2, on=:id);
137.688439 seconds (194 allocations: 6.260 GiB, 76.09% gc time)
julia> @time innerjoin(df2, df1, on=:id);
61.029408 seconds (194 allocations: 1.588 GiB, 35.07% gc time)
julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));
julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));
julia> @time innerjoin(df1, df2, on=:id);
27.746145 seconds (194 allocations: 5.712 GiB, 54.92% gc time)
julia> @time innerjoin(df2, df1, on=:id);
16.150328 seconds (194 allocations: 1.582 GiB, 20.20% gc time)
julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));
julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));
julia> @time innerjoin(df1, df2, on=:id);
116.331639 seconds (196 allocations: 5.727 GiB, 65.69% gc time)
julia> @time innerjoin(df2, df1, on=:id);
52.525720 seconds (196 allocations: 1.597 GiB, 30.45% gc time) PR: julia> using Random, DataFrames
julia> Random.seed!(1234);
julia> df1 = DataFrame(id = sort!(string.(1:10^6)));
julia> df2 = DataFrame(id = sort!(string.(1:10^8)));
julia> @time innerjoin(df1, df2, on=:id);
4.144959 seconds (1.70 M allocations: 122.124 MiB, 32.25% compilation time)
julia> @time innerjoin(df2, df1, on=:id);
2.769002 seconds (154 allocations: 22.900 MiB)
julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));
julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));
julia> @time innerjoin(df1, df2, on=:id);
41.557779 seconds (257.89 k allocations: 105.835 MiB, 0.58% compilation time)
julia> @time innerjoin(df2, df1, on=:id);
40.998080 seconds (246 allocations: 90.813 MiB)
julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));
julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));
julia> @time innerjoin(df1, df2, on=:id);
3.731307 seconds (154 allocations: 22.113 MiB)
julia> @time innerjoin(df2, df1, on=:id);
3.816319 seconds (154 allocations: 22.113 MiB)
julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));
julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));
julia> @time innerjoin(df1, df2, on=:id);
39.876243 seconds (1.27 M allocations: 146.741 MiB)
julia> @time innerjoin(df2, df1, on=:id);
39.830833 seconds (1.27 M allocations: 146.741 MiB) |
Here are benchmarks on integer columns. In this comparison the PR looks much better (so smaller datacurrent main:
this PR:
larger datacurrent main:
this PR:
|
I do not think this would be a common case as most likely you are joining columns coming from different sources.
Not exactly - in some cases it is a bit slower as it allocates a lot more than the old one.
I will run such benchmarks and report the results |
Here are tests for two columns (smaller data, as this is more problematic). In general it is not bad. What I do in my PR is that I allocate a vector of tuples from a tuple of vectors, and then things are fast. The creation of this vector uses memory (this is bad), but is relatively fast. current
this PR
|
@nalimilan - thinking of this the crucial tension is:
So there is a U shaped relationship between the PR and What I will do is investigate if using |
Well yeah that's a special case where there's a one-to-one correspondence between tables. Maybe checking that one pool is a superset of the other is more useful.
Would it help to allocate a large vector and reuse it? Just a thought. |
If I understand you correctly this is exactly what |
Well the different |
I have an idea how to improve things without allocating much. I will try to push to the PR and compare. |
I have pushed another commit - for comparison - using a strategy similar to |
This commit looks uniformly much better than whatever we had. Therefore I think it is good to have a look at now from the implementation perspective and I will write more correctness tests. The strategy is:
small string:
large string:
small int:
large int:
small mixed:
|
When we finalize |
I have added tests and NEWS.md (@andyferris - in the end I use a much more complex algorithm than the one in SplitApplyCombine.jl). This should be ready for a review. |
@nalimilan - the only thing left to add is probably special path for PooledArray.jl and CategoricalArray.jl (but it should not significantly change the logic I have now). I will try to work on it during this weekend. |
@nalimilan Here are timings for current
this PR:
|
Having a generic access to `invrefpool` is needed in JuliaData/DataFrames.jl#2612. Consider a short table and a long table joined on some column. In order to be fast we need to map values from short table key to ref values of long table key. This allows two things for `innerjoin`: 1. we immediately can drop values from short table not present in long table. 2. later we can do join on integer columns which is way faster than joining on e.g. string column. Also since we do mapping of short table this operation should be fast. In particular if short table defines `refarray` it is particularly fast, as we only need to map the reference values. For CategoricalArrays.jl and PooledArrays.jl `invrefpool` is simply `get` on the inverted pool `Dict` with `nothing` as a sentinel. I am not sure what would have to be defined in Arrow.jl.
I have a I should have also sent it to you both. Any ideas on how best to share it? |
@nalimilan - I have added |
Can you please send it to my email? It is my Github user @ gmail. |
I have checked the code on:
and unfortunately (as it would be better if we could reproduce it) it does not error. |
I'm sorry, it was my fault 😊 The bug @pdeffebach discovered was a very interesting one involving the display size, the alignment of floating points, and the cropping algorithm. The fix, fortunately, is very simple. I will release a new version |
@pdeffebach + @ronisbr - thank you form making JuliaData ecosystem better and better every day! |
Thanks! And sorry for this bug. It was really an important case that I totally missed when creating the alignment algorithm. @pdeffebach PrettyTables.jl v0.11.1 was just released. Can you please try again to see if this bug is fixed? |
@nalimilan - to keep track of the GC issue I am uploading the files here (as Discourse disallows it). Example code: Timing results: |
I really thought this was a PooledArrays or SentinalArrays issue. However @ronisbr's updated version of PrettyTables made this work! |
@nalimilan. I have benchmarked the code thoroughly and it is consistently not worse. I have pushed the design of benchmarking code that most of the times does not lead to horrendous GC (but sometimes it still does especially when working with CategoricalArrays.jl). Also I have experimented with not using internal functions in this commit a3de1c2. It reduces the performance by ~ 5%. I have reverted this commit, but maybe we can sacrifice this 5% for the sake of cleaner code. Apart from this decision this PR is done. |
5% sounds acceptable if that makes the code less likely to break in the future. |
This reverts commit 07ecb0a.
@nalimilan - I have reviewed everything again and pushed some small cleanups. Could you please have a look at the PR for the last time before I merge? Thank you! |
Thank you! The next two PRs are:
|
First step towards resolving #2340.
Doing
innerjoin
as it is most common and different than other (it does not introducemissing
s).It passes tests, but I have to do some benchmarking of it. I will post the results. When I am sure it is OK I will mark it as ready and update NEWS.md.
Of course some external tests are welcome.
CC @nalimilan @andyferris (I use the ideas from SplitApplyCombine.jl and our dicsussions, but decided not to introduce a dependency - I hope it is OK)