Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement faster innerjoin #2612

Merged
merged 63 commits into from
Feb 13, 2021
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jan 26, 2021

First step towards resolving #2340.
Doing innerjoin as it is most common and different than other (it does not introduce missings).

It passes tests, but I have to do some benchmarking of it. I will post the results. When I am sure it is OK I will mark it as ready and update NEWS.md.

Of course some external tests are welcome.

CC @nalimilan @andyferris (I use the ideas from SplitApplyCombine.jl and our dicsussions, but decided not to introduce a dependency - I hope it is OK)

@bkamins bkamins added non-breaking The proposed change is not breaking performance labels Jan 26, 2021
@bkamins bkamins added this to the 1.0 milestone Jan 26, 2021
@nalimilan
Copy link
Member

Thanks for working on this. Any reason not to reuse our grouping code? Optimizations matter quite a lot for PooledArray/CategoricalArray, and soon for integers.

@bkamins
Copy link
Member Author

bkamins commented Jan 27, 2021

Optimizations matter quite a lot for PooledArray/CategoricalArray, and soon for integers.

I was not clear how to take advantage of these optimizations. The reason is that even if e.g. df1 and df2 have a column that is a PooledArray these arrays are not the same. I can do grouping faster, but how do I then match the second data frame to the first one?

@bkamins
Copy link
Member Author

bkamins commented Jan 27, 2021

I have added a fast join path for sorted tables. Unfortunately it cannot be used with CategoricalVector.

@bkamins
Copy link
Member Author

bkamins commented Jan 27, 2021

OK. So here are the timings. The conclusion is:

  1. it would be good if someone run the same benchmark on a different machine as they show mixed results
  2. for sure it is worth to check if data is sorted and then use a fast path
  3. for sure fixing the table ordering issue in current implementation is something that we should fix and is easy
  4. current implementation (on main) and the proposed one (this PR) have comparable performance if data is not sorted and the correct order of data frames is used (so it would be good to review the PR from the perspective of performance if someone is willing to) - this is except for one case (a lot of unique groups that are not sorted - the PR is here significantly faster also)

@nalimilan - what do you think we should do?

smaller data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.364543 seconds (183 allocations: 707.847 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 23.762778 seconds (183 allocations: 244.524 MiB, 1.80% gc time)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.568400 seconds (183 allocations: 707.847 MiB, 34.00% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 27.465237 seconds (183 allocations: 244.524 MiB, 4.53% gc time)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.779967 seconds (183 allocations: 651.651 MiB, 32.21% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.737591 seconds (183 allocations: 238.863 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  6.351662 seconds (185 allocations: 667.052 MiB, 41.23% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  4.315376 seconds (185 allocations: 254.227 MiB)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.262994 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.259001 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.410780 seconds (241 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.433197 seconds (241 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.307506 seconds (149 allocations: 22.109 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.313784 seconds (149 allocations: 22.109 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  5.411631 seconds (1.27 M allocations: 146.839 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  6.639238 seconds (1.27 M allocations: 146.839 MiB, 20.04% gc time)

bigger data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 38.194830 seconds (183 allocations: 6.260 GiB, 37.01% gc time)

julia> @time innerjoin(df2, df1, on=:id);
414.828265 seconds (183 allocations: 1.580 GiB, 0.81% gc time)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
208.718680 seconds (183 allocations: 6.260 GiB, 75.74% gc time)

julia> @time innerjoin(df2, df1, on=:id);
388.470967 seconds (183 allocations: 1.580 GiB, 7.89% gc time)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 26.541641 seconds (185 allocations: 5.727 GiB, 48.55% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 19.075522 seconds (185 allocations: 1.589 GiB, 16.08% gc time)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
145.480645 seconds (183 allocations: 5.712 GiB, 65.00% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 61.259306 seconds (183 allocations: 1.574 GiB, 33.79% gc time)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  2.557273 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  2.511205 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 42.093122 seconds (241 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 41.584282 seconds (241 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  2.956578 seconds (149 allocations: 22.132 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  2.964904 seconds (149 allocations: 22.132 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 69.561194 seconds (1.27 M allocations: 131.589 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 69.840764 seconds (1.27 M allocations: 131.589 MiB)

@nalimilan
Copy link
Member

I have added a fast join path for sorted tables. Unfortunately it cannot be used with CategoricalVector.

Yeah, the comparison between pools is still an annoying problem and I haven't tried implementing the global table to fix that. At least it shouldn't be hard to check whether refpools are equal, and if so you can work directly on the refarrays. That should cover most cases, and you'll get efficient PooledArray support too. It would probably be possible to check whether one refpool is an ordered subset of the other do to clever things, but that can be left for later.

@nalimilan - what do you think we should do?

What's your question exactly? If I understand correctly, this PR is always as fast as main or faster, so I have nothing to object. :-)

Something which would be worth benchmarking is joining on multiple columns. I think it's the case where hashing columns one by one like hashrows_cols! does make the biggest difference.

Here's another run of your benchmarks (on a Xeon 4114 at 2.20GHz):

smaller data

main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.712898 seconds (2.43 M allocations: 843.429 MiB, 23.98% gc time, 40.77% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  1.980586 seconds (194 allocations: 252.524 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.824913 seconds (194 allocations: 707.847 MiB, 45.99% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  3.083552 seconds (194 allocations: 252.524 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.624032 seconds (196 allocations: 667.021 MiB, 15.88% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.563919 seconds (196 allocations: 262.201 MiB, 9.95% gc time)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  5.085381 seconds (196 allocations: 666.978 MiB, 37.59% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  2.794582 seconds (196 allocations: 262.180 MiB)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.964033 seconds (1.70 M allocations: 122.239 MiB, 19.72% gc time, 65.16% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  0.357172 seconds (154 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  3.679063 seconds (257.89 k allocations: 105.835 MiB, 5.86% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  3.884744 seconds (246 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.404584 seconds (154 allocations: 22.133 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.409299 seconds (154 allocations: 22.133 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  5.682937 seconds (1.27 M allocations: 146.871 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  6.104373 seconds (1.27 M allocations: 146.871 MiB)

bigger data

main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 46.291461 seconds (194 allocations: 6.260 GiB, 42.79% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 28.256333 seconds (194 allocations: 1.588 GiB, 21.50% gc time)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
137.688439 seconds (194 allocations: 6.260 GiB, 76.09% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 61.029408 seconds (194 allocations: 1.588 GiB, 35.07% gc time)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 27.746145 seconds (194 allocations: 5.712 GiB, 54.92% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 16.150328 seconds (194 allocations: 1.582 GiB, 20.20% gc time)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
116.331639 seconds (196 allocations: 5.727 GiB, 65.69% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 52.525720 seconds (196 allocations: 1.597 GiB, 30.45% gc time)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  4.144959 seconds (1.70 M allocations: 122.124 MiB, 32.25% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  2.769002 seconds (154 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 41.557779 seconds (257.89 k allocations: 105.835 MiB, 0.58% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
 40.998080 seconds (246 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  3.731307 seconds (154 allocations: 22.113 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  3.816319 seconds (154 allocations: 22.113 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 39.876243 seconds (1.27 M allocations: 146.741 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 39.830833 seconds (1.27 M allocations: 146.741 MiB)

@bkamins
Copy link
Member Author

bkamins commented Jan 27, 2021

Here are benchmarks on integer columns. In this comparison the PR looks much better (so String case is a hard one):

smaller data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.637557 seconds (183 allocations: 707.847 MiB, 15.81% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 18.728374 seconds (183 allocations: 244.524 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.874104 seconds (183 allocations: 707.847 MiB, 5.16% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 22.334269 seconds (183 allocations: 244.524 MiB, 0.14% gc time)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.040792 seconds (183 allocations: 651.651 MiB, 1.98% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.436125 seconds (183 allocations: 238.863 MiB, 7.20% gc time)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  2.874287 seconds (185 allocations: 667.052 MiB, 1.03% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  2.081790 seconds (185 allocations: 254.227 MiB)

this PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.028745 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.027683 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.709983 seconds (245 allocations: 90.813 MiB, 1.41% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  0.701188 seconds (245 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.043311 seconds (149 allocations: 22.109 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.042666 seconds (149 allocations: 22.109 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  1.104376 seconds (1.27 M allocations: 146.839 MiB, 14.43% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.098397 seconds (1.27 M allocations: 146.839 MiB, 12.16% gc time)

larger data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 16.249414 seconds (183 allocations: 6.260 GiB, 1.73% gc time)

julia> @time innerjoin(df2, df1, on=:id);
167.121607 seconds (183 allocations: 1.580 GiB, 0.14% gc time)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 17.036872 seconds (183 allocations: 6.260 GiB, 1.85% gc time)

julia> @time innerjoin(df2, df1, on=:id);
171.352191 seconds (183 allocations: 1.580 GiB, 0.14% gc time)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  9.675349 seconds (185 allocations: 5.727 GiB, 3.15% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  9.937535 seconds (185 allocations: 1.589 GiB, 2.17% gc time)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 24.858718 seconds (183 allocations: 5.712 GiB, 0.41% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 14.484201 seconds (183 allocations: 1.574 GiB, 1.55% gc time)

this PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.107601 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.109973 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  4.833485 seconds (245 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.871002 seconds (245 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.118253 seconds (149 allocations: 22.132 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.121240 seconds (149 allocations: 22.132 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
  4.322843 seconds (1.27 M allocations: 131.589 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.342500 seconds (1.27 M allocations: 131.589 MiB)

@bkamins
Copy link
Member Author

bkamins commented Jan 27, 2021

and if so you can work directly on the refarrays. That should cover most cases

I do not think this would be a common case as most likely you are joining columns coming from different sources.

this PR is always as fast as main or faster

Not exactly - in some cases it is a bit slower as it allocates a lot more than the old one.

Something which would be worth benchmarking is joining on multiple columns.

I will run such benchmarks and report the results

@bkamins
Copy link
Member Author

bkamins commented Jan 27, 2021

Here are tests for two columns (smaller data, as this is more problematic).

In general it is not bad. What I do in my PR is that I allocate a vector of tuples from a tuple of vectors, and then things are fast. The creation of this vector uses memory (this is bad), but is relatively fast.

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id1 = sort!(string.(1:10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(string.(1:10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  1.643699 seconds (189 allocations: 677.330 MiB, 17.62% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
 20.949350 seconds (189 allocations: 214.007 MiB)

julia> df1 = DataFrame(id1 = shuffle!(string.(1:10^6)));

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = shuffle!(string.(1:10^7)));

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  4.515834 seconds (198 allocations: 715.477 MiB, 29.89% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
 24.911806 seconds (198 allocations: 252.154 MiB, 4.40% gc time)

julia> df1 = DataFrame(id1 = sort!(rand(string.(1:10^6), 10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(rand(string.(1:10^7), 10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  1.730993 seconds (189 allocations: 677.330 MiB, 20.93% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
 20.352023 seconds (189 allocations: 214.007 MiB)

julia> df1 = DataFrame(id1 = rand(string.(1:10^6), 10^6), id2 = 1:10^6);

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = rand(string.(1:10^7), 10^7), id2 = 1:10^7);

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  6.393938 seconds (200 allocations: 674.696 MiB, 36.35% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  4.530607 seconds (200 allocations: 261.871 MiB)

this PR

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id1 = sort!(string.(1:10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(string.(1:10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.419206 seconds (167 allocations: 183.118 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.650134 seconds (167 allocations: 183.118 MiB, 38.94% gc time)

julia> df1 = DataFrame(id1 = shuffle!(string.(1:10^6)));

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = shuffle!(string.(1:10^7)));

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  6.374714 seconds (265 allocations: 296.956 MiB, 27.80% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  4.678481 seconds (265 allocations: 296.956 MiB)

julia> df1 = DataFrame(id1 = sort!(rand(string.(1:10^6), 10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(rand(string.(1:10^7), 10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.882602 seconds (167 allocations: 183.118 MiB, 44.24% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.437506 seconds (167 allocations: 183.118 MiB)

julia> df1 = DataFrame(id1 = rand(string.(1:10^6), 10^6), id2 = 1:10^6);

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = rand(string.(1:10^7), 10^7), id2 = 1:10^7);

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  5.791302 seconds (1.27 M allocations: 337.006 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  7.345807 seconds (1.27 M allocations: 337.006 MiB, 22.06% gc time)

@bkamins
Copy link
Member Author

bkamins commented Jan 28, 2021

@nalimilan - thinking of this the crucial tension is:

  1. if we do not have duplicates in shorter column - I am pretty confident that what we do in the PR is better.
  2. if we have a few groups in shorter column - I am pretty confident that what we do in the PR is better (as we allocate only a few vectors that are long and in continuous blocks of memory)
  3. if we have a lot of groups, in shorter column, but have some duplicates - here the real tension is; our approach will allocate a lot of small vectors, while the groupby code allocates just four vectors (for the groups, permutation, starts and ends) -> in such cases this will be more efficient.

So there is a U shaped relationship between the PR and main: the PR is better in the extremes (no duplicates or very few groups), and groupby code should be better in the case of very many groups but with at least one duplicate.

What I will do is investigate if using dict = Dict{Union{T, Vector{T}}, Int}() would be better than having two separate code paths. Hopefully it will be handled relatively efficiently by the compiler (in which case we will reduce the range where the PR is worse - then the worst case will be "all groups have exactly two entries").

@nalimilan
Copy link
Member

I do not think this would be a common case as most likely you are joining columns coming from different sources.

Well yeah that's a special case where there's a one-to-one correspondence between tables. Maybe checking that one pool is a superset of the other is more useful.

  • if we have a lot of groups, in shorter column, but have some duplicates - here the real tension is; our approach will allocate a lot of small vectors, while the groupby code allocates just four vectors (for the groups, permutation, starts and ends) -> in such cases this will be more efficient.

Would it help to allocate a large vector and reuse it? Just a thought.

@bkamins
Copy link
Member Author

bkamins commented Jan 28, 2021

Would it help to allocate a large vector and reuse it? Just a thought.

If I understand you correctly this is exactly what groupby does. Right?

@nalimilan
Copy link
Member

Well the different groupby methods indeed only allocate a few large vectors. But they don't reuse them for different tasks, it's just that they don't need to allocate many small vectors. (Note that I may be misunderstanding something as I haven't looked at the PR carefully.)

@bkamins
Copy link
Member Author

bkamins commented Jan 28, 2021

I have an idea how to improve things without allocating much. I will try to push to the PR and compare.

@bkamins
Copy link
Member Author

bkamins commented Jan 28, 2021

I have pushed another commit - for comparison - using a strategy similar to groupby (but simpler as we do not have to handle missing groups).

@bkamins
Copy link
Member Author

bkamins commented Jan 28, 2021

This commit looks uniformly much better than whatever we had. Therefore I think it is good to have a look at now from the implementation perspective and I will write more correctness tests.

The strategy is:

  1. check if both tables are sorted; if yes use merge-join
  2. try with assumption that shorter table has unique keys
  3. if step 2 fails (i.e. we have duplicates) use an improved version of groupby code but reuse all work done by step 2

small string:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.222383 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.225938 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.487154 seconds (241 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.481965 seconds (241 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.283161 seconds (149 allocations: 22.109 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.289440 seconds (149 allocations: 22.109 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  1.900495 seconds (205 allocations: 76.959 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  1.893870 seconds (205 allocations: 76.959 MiB)

large string:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  2.072145 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  2.121213 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 42.936439 seconds (241 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 42.488033 seconds (241 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  2.251284 seconds (149 allocations: 22.132 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  2.280764 seconds (149 allocations: 22.132 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 25.902032 seconds (205 allocations: 76.959 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 25.711305 seconds (205 allocations: 76.959 MiB)

small int:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.036820 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.036265 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.732560 seconds (245 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.726319 seconds (245 allocations: 90.813 MiB, 1.28% gc time)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.036151 seconds (149 allocations: 22.109 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.035766 seconds (149 allocations: 22.109 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  0.081042 seconds (209 allocations: 76.959 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.096436 seconds (209 allocations: 76.959 MiB, 7.97% gc time)

large int:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.109040 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.110280 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  4.917345 seconds (245 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  5.140699 seconds (245 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.118604 seconds (149 allocations: 22.132 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.120068 seconds (149 allocations: 22.132 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
  0.648267 seconds (209 allocations: 76.959 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.651894 seconds (209 allocations: 76.959 MiB)

small mixed:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id1 = sort!(string.(1:10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(string.(1:10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.490610 seconds (167 allocations: 183.118 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.442510 seconds (167 allocations: 183.118 MiB)

julia> df1 = DataFrame(id1 = shuffle!(string.(1:10^6)));

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = shuffle!(string.(1:10^7)));

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  6.871464 seconds (265 allocations: 296.956 MiB, 28.48% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  4.842933 seconds (265 allocations: 296.956 MiB)

julia> df1 = DataFrame(id1 = sort!(rand(string.(1:10^6), 10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(rand(string.(1:10^7), 10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.798640 seconds (167 allocations: 183.118 MiB, 34.24% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.506351 seconds (167 allocations: 183.118 MiB)

julia> df1 = DataFrame(id1 = rand(string.(1:10^6), 10^6), id2 = 1:10^6);

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = rand(string.(1:10^7), 10^7), id2 = 1:10^7);

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  2.025413 seconds (228 allocations: 259.479 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  2.057292 seconds (228 allocations: 259.479 MiB)

@bkamins
Copy link
Member Author

bkamins commented Jan 29, 2021

When we finalize innerjoin, all the other joins will be very cheap to add based on the innerjoin code. I already have have an idea how to finalize this, but first I would like to make sure we are OK with this implementation (today I will write tests for the new functionality, as current test coverage of joins is not very good).

@bkamins bkamins marked this pull request as ready for review January 29, 2021 14:00
@bkamins
Copy link
Member Author

bkamins commented Jan 29, 2021

I have added tests and NEWS.md (@andyferris - in the end I use a much more complex algorithm than the one in SplitApplyCombine.jl). This should be ready for a review.

@bkamins
Copy link
Member Author

bkamins commented Jan 29, 2021

@nalimilan - the only thing left to add is probably special path for PooledArray.jl and CategoricalArray.jl (but it should not significantly change the logic I have now). I will try to work on it during this weekend.

@bkamins
Copy link
Member Author

bkamins commented Jan 29, 2021

@nalimilan Here are timings for PooledArray. The conclusion is that we are better, but it should be possible to be much better here than we are now. The question is if we should do "pool merging" (which can be also expensive) or some other approach. In particular it would be useful to have invrefpool in DataAPI.jl along with refpool.

current main:

julia> using Random, DataFrames, PooledArrays

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = PooledArray(sort!(string.(1:10^6))));

julia> df2 = DataFrame(id = PooledArray(sort!(string.(1:10^7))));

julia> @time innerjoin(df1, df2, on=:id);
  3.023138 seconds (259 allocations: 762.870 MiB, 37.87% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 20.032685 seconds (259 allocations: 299.547 MiB)

julia> df1 = DataFrame(id = PooledArray((shuffle!(string.(1:10^6)))));

julia> df2 = DataFrame(id = PooledArray((shuffle!(string.(1:10^7)))));

julia> @time innerjoin(df1, df2, on=:id);
  5.601759 seconds (259 allocations: 762.870 MiB, 30.31% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 25.203538 seconds (259 allocations: 299.547 MiB, 3.24% gc time)

julia> df1 = DataFrame(id = PooledArray((sort!(rand(string.(1:10^6), 10^6)))));

julia> df2 = DataFrame(id = PooledArray((sort!(rand(string.(1:10^7), 10^7)))));

julia> @time innerjoin(df1, df2, on=:id);
  2.053148 seconds (252 allocations: 676.680 MiB, 25.71% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.849917 seconds (252 allocations: 263.892 MiB)

julia> df1 = DataFrame(id = PooledArray((rand(string.(1:10^6), 10^6))));

julia> df2 = DataFrame(id = PooledArray((rand(string.(1:10^7), 10^7))));

julia> @time innerjoin(df1, df2, on=:id);
  4.973785 seconds (254 allocations: 692.068 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  6.831288 seconds (254 allocations: 279.243 MiB, 19.35% gc time)

this PR:

julia> using Random, DataFrames, PooledArrays

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = PooledArray(sort!(string.(1:10^6))));

julia> df2 = DataFrame(id = PooledArray(sort!(string.(1:10^7))));

julia> @time innerjoin(df1, df2, on=:id);
  0.277155 seconds (160 allocations: 52.715 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  1.334287 seconds (160 allocations: 303.380 MiB, 62.15% gc time)

julia> df1 = DataFrame(id = PooledArray((shuffle!(string.(1:10^6)))));

julia> df2 = DataFrame(id = PooledArray((shuffle!(string.(1:10^7)))));

julia> @time innerjoin(df1, df2, on=:id);
  4.865085 seconds (252 allocations: 120.628 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  5.070394 seconds (252 allocations: 371.292 MiB)

julia> df1 = DataFrame(id = PooledArray((sort!(rand(string.(1:10^6), 10^6)))));

julia> df2 = DataFrame(id = PooledArray((sort!(rand(string.(1:10^7), 10^7)))));

julia> @time innerjoin(df1, df2, on=:id);
  0.327582 seconds (160 allocations: 36.904 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.510882 seconds (160 allocations: 275.302 MiB)

julia> df1 = DataFrame(id = PooledArray((rand(string.(1:10^6), 10^6))));

julia> df2 = DataFrame(id = PooledArray((rand(string.(1:10^7), 10^7))));

julia> @time innerjoin(df1, df2, on=:id);
  6.056494 seconds (219 allocations: 108.253 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  6.255971 seconds (219 allocations: 346.669 MiB)

bkamins added a commit to JuliaData/DataAPI.jl that referenced this pull request Jan 29, 2021
Having a generic access to `invrefpool` is needed in JuliaData/DataFrames.jl#2612.

Consider a short table and a long table joined on some column. In order to be fast we need to map values from short table key to ref values of long table key. This allows two things for `innerjoin`:
1. we immediately can drop values from short table not present in long table.
2. later we can do join on integer columns which is way faster than joining on e.g. string column.

Also since we do mapping of short table this operation should be fast.

In particular if short table defines `refarray` it is particularly fast, as we only need to map the reference values.

For CategoricalArrays.jl and PooledArrays.jl `invrefpool` is simply `get` on the inverted pool `Dict` with `nothing` as a sentinel.

I am not sure what would have to be defined in Arrow.jl.
@pdeffebach
Copy link
Contributor

I have a .zip prepared for an MWE with a -deved version of DataFrames contained. I sent it to Jacob last night since I think the error is from CSV.

I should have also sent it to you both. Any ideas on how best to share it?

@bkamins
Copy link
Member Author

bkamins commented Feb 10, 2021

@nalimilan - I have added sizehint! annotations that improve performance a bit (but I need to sort out the issue with the problem with my benchmarking code before merging)

@ronisbr
Copy link
Member

ronisbr commented Feb 10, 2021

I should have also sent it to you both. Any ideas on how best to share it?

Can you please send it to my email? It is my Github user @ gmail.

@bkamins
Copy link
Member Author

bkamins commented Feb 10, 2021

I have checked the code on:

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

and unfortunately (as it would be better if we could reproduce it) it does not error.

@ronisbr
Copy link
Member

ronisbr commented Feb 10, 2021

I'm sorry, it was my fault 😊

The bug @pdeffebach discovered was a very interesting one involving the display size, the alignment of floating points, and the cropping algorithm. The fix, fortunately, is very simple. I will release a new version 0.11.1 with this fix in the following minutes.

@bkamins
Copy link
Member Author

bkamins commented Feb 10, 2021

@pdeffebach + @ronisbr - thank you form making JuliaData ecosystem better and better every day!

@ronisbr
Copy link
Member

ronisbr commented Feb 10, 2021

@pdeffebach + @ronisbr - thank you form making JuliaData ecosystem better and better every day!

Thanks! And sorry for this bug. It was really an important case that I totally missed when creating the alignment algorithm.

@pdeffebach PrettyTables.jl v0.11.1 was just released. Can you please try again to see if this bug is fixed?

@bkamins
Copy link
Member Author

bkamins commented Feb 10, 2021

@nalimilan - to keep track of the GC issue I am uploading the files here (as Discourse disallows it).

Example code:
example.txt

Timing results:
timing.txt

@pdeffebach
Copy link
Contributor

I really thought this was a PooledArrays or SentinalArrays issue.

However @ronisbr's updated version of PrettyTables made this work!

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2021

@nalimilan. I have benchmarked the code thoroughly and it is consistently not worse. I have pushed the design of benchmarking code that most of the times does not lead to horrendous GC (but sometimes it still does especially when working with CategoricalArrays.jl).

Also I have experimented with not using internal functions in this commit a3de1c2. It reduces the performance by ~ 5%. I have reverted this commit, but maybe we can sacrifice this 5% for the sake of cleaner code.

Apart from this decision this PR is done.

@nalimilan
Copy link
Member

5% sounds acceptable if that makes the code less likely to break in the future.

test/join.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Feb 12, 2021

@nalimilan - I have reviewed everything again and pushed some small cleanups. Could you please have a look at the PR for the last time before I merge? Thank you!

NEWS.md Outdated Show resolved Hide resolved
@bkamins bkamins merged commit 726b4e4 into JuliaData:main Feb 13, 2021
@bkamins bkamins deleted the new_faster_innerjoin branch February 13, 2021 11:43
@bkamins
Copy link
Member Author

bkamins commented Feb 13, 2021

Thank you!

The next two PRs are:

  1. faster leftjoin, rigthjoin and outerjoin
  2. faster semijoin and antijoin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking performance priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants