implement faster innerjoin #2612

bkamins · 2021-01-26T23:09:41Z

First step towards resolving #2340.
Doing innerjoin as it is most common and different than other (it does not introduce missings).

It passes tests, but I have to do some benchmarking of it. I will post the results. When I am sure it is OK I will mark it as ready and update NEWS.md.

Of course some external tests are welcome.

CC @nalimilan @andyferris (I use the ideas from SplitApplyCombine.jl and our dicsussions, but decided not to introduce a dependency - I hope it is OK)

nalimilan · 2021-01-27T08:24:09Z

Thanks for working on this. Any reason not to reuse our grouping code? Optimizations matter quite a lot for PooledArray/CategoricalArray, and soon for integers.

bkamins · 2021-01-27T08:29:08Z

Optimizations matter quite a lot for PooledArray/CategoricalArray, and soon for integers.

I was not clear how to take advantage of these optimizations. The reason is that even if e.g. df1 and df2 have a column that is a PooledArray these arrays are not the same. I can do grouping faster, but how do I then match the second data frame to the first one?

bkamins · 2021-01-27T10:33:45Z

I have added a fast join path for sorted tables. Unfortunately it cannot be used with CategoricalVector.

bkamins · 2021-01-27T14:08:08Z

OK. So here are the timings. The conclusion is:

it would be good if someone run the same benchmark on a different machine as they show mixed results
for sure it is worth to check if data is sorted and then use a fast path
for sure fixing the table ordering issue in current implementation is something that we should fix and is easy
current implementation (on main) and the proposed one (this PR) have comparable performance if data is not sorted and the correct order of data frames is used (so it would be good to review the PR from the perspective of performance if someone is willing to) - this is except for one case (a lot of unique groups that are not sorted - the PR is here significantly faster also)

@nalimilan - what do you think we should do?

smaller data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.364543 seconds (183 allocations: 707.847 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 23.762778 seconds (183 allocations: 244.524 MiB, 1.80% gc time)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.568400 seconds (183 allocations: 707.847 MiB, 34.00% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 27.465237 seconds (183 allocations: 244.524 MiB, 4.53% gc time)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.779967 seconds (183 allocations: 651.651 MiB, 32.21% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.737591 seconds (183 allocations: 238.863 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  6.351662 seconds (185 allocations: 667.052 MiB, 41.23% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  4.315376 seconds (185 allocations: 254.227 MiB)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.262994 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.259001 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.410780 seconds (241 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.433197 seconds (241 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.307506 seconds (149 allocations: 22.109 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.313784 seconds (149 allocations: 22.109 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  5.411631 seconds (1.27 M allocations: 146.839 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  6.639238 seconds (1.27 M allocations: 146.839 MiB, 20.04% gc time)

bigger data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 38.194830 seconds (183 allocations: 6.260 GiB, 37.01% gc time)

julia> @time innerjoin(df2, df1, on=:id);
414.828265 seconds (183 allocations: 1.580 GiB, 0.81% gc time)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
208.718680 seconds (183 allocations: 6.260 GiB, 75.74% gc time)

julia> @time innerjoin(df2, df1, on=:id);
388.470967 seconds (183 allocations: 1.580 GiB, 7.89% gc time)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 26.541641 seconds (185 allocations: 5.727 GiB, 48.55% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 19.075522 seconds (185 allocations: 1.589 GiB, 16.08% gc time)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
145.480645 seconds (183 allocations: 5.712 GiB, 65.00% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 61.259306 seconds (183 allocations: 1.574 GiB, 33.79% gc time)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  2.557273 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  2.511205 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 42.093122 seconds (241 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 41.584282 seconds (241 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  2.956578 seconds (149 allocations: 22.132 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  2.964904 seconds (149 allocations: 22.132 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 69.561194 seconds (1.27 M allocations: 131.589 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 69.840764 seconds (1.27 M allocations: 131.589 MiB)

nalimilan · 2021-01-27T22:10:14Z

I have added a fast join path for sorted tables. Unfortunately it cannot be used with CategoricalVector.

Yeah, the comparison between pools is still an annoying problem and I haven't tried implementing the global table to fix that. At least it shouldn't be hard to check whether refpools are equal, and if so you can work directly on the refarrays. That should cover most cases, and you'll get efficient PooledArray support too. It would probably be possible to check whether one refpool is an ordered subset of the other do to clever things, but that can be left for later.

@nalimilan - what do you think we should do?

What's your question exactly? If I understand correctly, this PR is always as fast as main or faster, so I have nothing to object. :-)

Something which would be worth benchmarking is joining on multiple columns. I think it's the case where hashing columns one by one like hashrows_cols! does make the biggest difference.

Here's another run of your benchmarks (on a Xeon 4114 at 2.20GHz):

smaller data

main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.712898 seconds (2.43 M allocations: 843.429 MiB, 23.98% gc time, 40.77% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  1.980586 seconds (194 allocations: 252.524 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.824913 seconds (194 allocations: 707.847 MiB, 45.99% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  3.083552 seconds (194 allocations: 252.524 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.624032 seconds (196 allocations: 667.021 MiB, 15.88% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.563919 seconds (196 allocations: 262.201 MiB, 9.95% gc time)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  5.085381 seconds (196 allocations: 666.978 MiB, 37.59% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  2.794582 seconds (196 allocations: 262.180 MiB)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.964033 seconds (1.70 M allocations: 122.239 MiB, 19.72% gc time, 65.16% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  0.357172 seconds (154 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  3.679063 seconds (257.89 k allocations: 105.835 MiB, 5.86% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  3.884744 seconds (246 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.404584 seconds (154 allocations: 22.133 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.409299 seconds (154 allocations: 22.133 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  5.682937 seconds (1.27 M allocations: 146.871 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  6.104373 seconds (1.27 M allocations: 146.871 MiB)

bigger data

main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 46.291461 seconds (194 allocations: 6.260 GiB, 42.79% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 28.256333 seconds (194 allocations: 1.588 GiB, 21.50% gc time)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
137.688439 seconds (194 allocations: 6.260 GiB, 76.09% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 61.029408 seconds (194 allocations: 1.588 GiB, 35.07% gc time)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 27.746145 seconds (194 allocations: 5.712 GiB, 54.92% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 16.150328 seconds (194 allocations: 1.582 GiB, 20.20% gc time)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
116.331639 seconds (196 allocations: 5.727 GiB, 65.69% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 52.525720 seconds (196 allocations: 1.597 GiB, 30.45% gc time)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  4.144959 seconds (1.70 M allocations: 122.124 MiB, 32.25% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  2.769002 seconds (154 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 41.557779 seconds (257.89 k allocations: 105.835 MiB, 0.58% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
 40.998080 seconds (246 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  3.731307 seconds (154 allocations: 22.113 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  3.816319 seconds (154 allocations: 22.113 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 39.876243 seconds (1.27 M allocations: 146.741 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 39.830833 seconds (1.27 M allocations: 146.741 MiB)

bkamins · 2021-01-27T22:15:06Z

Here are benchmarks on integer columns. In this comparison the PR looks much better (so String case is a hard one):

smaller data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.637557 seconds (183 allocations: 707.847 MiB, 15.81% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 18.728374 seconds (183 allocations: 244.524 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.874104 seconds (183 allocations: 707.847 MiB, 5.16% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 22.334269 seconds (183 allocations: 244.524 MiB, 0.14% gc time)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.040792 seconds (183 allocations: 651.651 MiB, 1.98% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.436125 seconds (183 allocations: 238.863 MiB, 7.20% gc time)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  2.874287 seconds (185 allocations: 667.052 MiB, 1.03% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  2.081790 seconds (185 allocations: 254.227 MiB)

this PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.028745 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.027683 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.709983 seconds (245 allocations: 90.813 MiB, 1.41% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  0.701188 seconds (245 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.043311 seconds (149 allocations: 22.109 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.042666 seconds (149 allocations: 22.109 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  1.104376 seconds (1.27 M allocations: 146.839 MiB, 14.43% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.098397 seconds (1.27 M allocations: 146.839 MiB, 12.16% gc time)

larger data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 16.249414 seconds (183 allocations: 6.260 GiB, 1.73% gc time)

julia> @time innerjoin(df2, df1, on=:id);
167.121607 seconds (183 allocations: 1.580 GiB, 0.14% gc time)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 17.036872 seconds (183 allocations: 6.260 GiB, 1.85% gc time)

julia> @time innerjoin(df2, df1, on=:id);
171.352191 seconds (183 allocations: 1.580 GiB, 0.14% gc time)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  9.675349 seconds (185 allocations: 5.727 GiB, 3.15% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  9.937535 seconds (185 allocations: 1.589 GiB, 2.17% gc time)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 24.858718 seconds (183 allocations: 5.712 GiB, 0.41% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 14.484201 seconds (183 allocations: 1.574 GiB, 1.55% gc time)

this PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.107601 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.109973 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  4.833485 seconds (245 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.871002 seconds (245 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.118253 seconds (149 allocations: 22.132 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.121240 seconds (149 allocations: 22.132 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
  4.322843 seconds (1.27 M allocations: 131.589 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.342500 seconds (1.27 M allocations: 131.589 MiB)

bkamins · 2021-01-27T23:00:24Z

and if so you can work directly on the refarrays. That should cover most cases

I do not think this would be a common case as most likely you are joining columns coming from different sources.

this PR is always as fast as main or faster

Not exactly - in some cases it is a bit slower as it allocates a lot more than the old one.

Something which would be worth benchmarking is joining on multiple columns.

I will run such benchmarks and report the results

bkamins · 2021-01-27T23:37:39Z

Here are tests for two columns (smaller data, as this is more problematic).

In general it is not bad. What I do in my PR is that I allocate a vector of tuples from a tuple of vectors, and then things are fast. The creation of this vector uses memory (this is bad), but is relatively fast.

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id1 = sort!(string.(1:10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(string.(1:10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  1.643699 seconds (189 allocations: 677.330 MiB, 17.62% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
 20.949350 seconds (189 allocations: 214.007 MiB)

julia> df1 = DataFrame(id1 = shuffle!(string.(1:10^6)));

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = shuffle!(string.(1:10^7)));

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  4.515834 seconds (198 allocations: 715.477 MiB, 29.89% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
 24.911806 seconds (198 allocations: 252.154 MiB, 4.40% gc time)

julia> df1 = DataFrame(id1 = sort!(rand(string.(1:10^6), 10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(rand(string.(1:10^7), 10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  1.730993 seconds (189 allocations: 677.330 MiB, 20.93% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
 20.352023 seconds (189 allocations: 214.007 MiB)

julia> df1 = DataFrame(id1 = rand(string.(1:10^6), 10^6), id2 = 1:10^6);

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = rand(string.(1:10^7), 10^7), id2 = 1:10^7);

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  6.393938 seconds (200 allocations: 674.696 MiB, 36.35% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  4.530607 seconds (200 allocations: 261.871 MiB)

this PR

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id1 = sort!(string.(1:10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(string.(1:10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.419206 seconds (167 allocations: 183.118 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.650134 seconds (167 allocations: 183.118 MiB, 38.94% gc time)

julia> df1 = DataFrame(id1 = shuffle!(string.(1:10^6)));

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = shuffle!(string.(1:10^7)));

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  6.374714 seconds (265 allocations: 296.956 MiB, 27.80% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  4.678481 seconds (265 allocations: 296.956 MiB)

julia> df1 = DataFrame(id1 = sort!(rand(string.(1:10^6), 10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(rand(string.(1:10^7), 10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.882602 seconds (167 allocations: 183.118 MiB, 44.24% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.437506 seconds (167 allocations: 183.118 MiB)

julia> df1 = DataFrame(id1 = rand(string.(1:10^6), 10^6), id2 = 1:10^6);

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = rand(string.(1:10^7), 10^7), id2 = 1:10^7);

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  5.791302 seconds (1.27 M allocations: 337.006 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  7.345807 seconds (1.27 M allocations: 337.006 MiB, 22.06% gc time)

bkamins · 2021-01-28T08:33:05Z

@nalimilan - thinking of this the crucial tension is:

if we do not have duplicates in shorter column - I am pretty confident that what we do in the PR is better.
if we have a few groups in shorter column - I am pretty confident that what we do in the PR is better (as we allocate only a few vectors that are long and in continuous blocks of memory)
if we have a lot of groups, in shorter column, but have some duplicates - here the real tension is; our approach will allocate a lot of small vectors, while the groupby code allocates just four vectors (for the groups, permutation, starts and ends) -> in such cases this will be more efficient.

So there is a U shaped relationship between the PR and main: the PR is better in the extremes (no duplicates or very few groups), and groupby code should be better in the case of very many groups but with at least one duplicate.

What I will do is investigate if using dict = Dict{Union{T, Vector{T}}, Int}() would be better than having two separate code paths. Hopefully it will be handled relatively efficiently by the compiler (in which case we will reduce the range where the PR is worse - then the worst case will be "all groups have exactly two entries").

nalimilan · 2021-01-28T09:17:23Z

I do not think this would be a common case as most likely you are joining columns coming from different sources.

Well yeah that's a special case where there's a one-to-one correspondence between tables. Maybe checking that one pool is a superset of the other is more useful.

if we have a lot of groups, in shorter column, but have some duplicates - here the real tension is; our approach will allocate a lot of small vectors, while the groupby code allocates just four vectors (for the groups, permutation, starts and ends) -> in such cases this will be more efficient.

Would it help to allocate a large vector and reuse it? Just a thought.

bkamins · 2021-01-28T10:44:59Z

Would it help to allocate a large vector and reuse it? Just a thought.

If I understand you correctly this is exactly what groupby does. Right?

nalimilan · 2021-01-28T10:49:35Z

Well the different groupby methods indeed only allocate a few large vectors. But they don't reuse them for different tasks, it's just that they don't need to allocate many small vectors. (Note that I may be misunderstanding something as I haven't looked at the PR carefully.)

bkamins · 2021-01-28T10:57:10Z

I have an idea how to improve things without allocating much. I will try to push to the PR and compare.

bkamins · 2021-01-28T14:26:27Z

I have pushed another commit - for comparison - using a strategy similar to groupby (but simpler as we do not have to handle missing groups).

bkamins · 2021-01-28T18:54:54Z

This commit looks uniformly much better than whatever we had. Therefore I think it is good to have a look at now from the implementation perspective and I will write more correctness tests.

The strategy is:

check if both tables are sorted; if yes use merge-join
try with assumption that shorter table has unique keys
if step 2 fails (i.e. we have duplicates) use an improved version of groupby code but reuse all work done by step 2

small string:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.222383 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.225938 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.487154 seconds (241 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.481965 seconds (241 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.283161 seconds (149 allocations: 22.109 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.289440 seconds (149 allocations: 22.109 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  1.900495 seconds (205 allocations: 76.959 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  1.893870 seconds (205 allocations: 76.959 MiB)

large string:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  2.072145 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  2.121213 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 42.936439 seconds (241 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 42.488033 seconds (241 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  2.251284 seconds (149 allocations: 22.132 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  2.280764 seconds (149 allocations: 22.132 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 25.902032 seconds (205 allocations: 76.959 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 25.711305 seconds (205 allocations: 76.959 MiB)

small int:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.036820 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.036265 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.732560 seconds (245 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.726319 seconds (245 allocations: 90.813 MiB, 1.28% gc time)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.036151 seconds (149 allocations: 22.109 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.035766 seconds (149 allocations: 22.109 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  0.081042 seconds (209 allocations: 76.959 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.096436 seconds (209 allocations: 76.959 MiB, 7.97% gc time)

large int:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.109040 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.110280 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  4.917345 seconds (245 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  5.140699 seconds (245 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.118604 seconds (149 allocations: 22.132 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.120068 seconds (149 allocations: 22.132 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
  0.648267 seconds (209 allocations: 76.959 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.651894 seconds (209 allocations: 76.959 MiB)

small mixed:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id1 = sort!(string.(1:10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(string.(1:10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.490610 seconds (167 allocations: 183.118 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.442510 seconds (167 allocations: 183.118 MiB)

julia> df1 = DataFrame(id1 = shuffle!(string.(1:10^6)));

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = shuffle!(string.(1:10^7)));

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  6.871464 seconds (265 allocations: 296.956 MiB, 28.48% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  4.842933 seconds (265 allocations: 296.956 MiB)

julia> df1 = DataFrame(id1 = sort!(rand(string.(1:10^6), 10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(rand(string.(1:10^7), 10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.798640 seconds (167 allocations: 183.118 MiB, 34.24% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.506351 seconds (167 allocations: 183.118 MiB)

julia> df1 = DataFrame(id1 = rand(string.(1:10^6), 10^6), id2 = 1:10^6);

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = rand(string.(1:10^7), 10^7), id2 = 1:10^7);

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  2.025413 seconds (228 allocations: 259.479 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  2.057292 seconds (228 allocations: 259.479 MiB)

bkamins · 2021-01-29T07:21:43Z

When we finalize innerjoin, all the other joins will be very cheap to add based on the innerjoin code. I already have have an idea how to finalize this, but first I would like to make sure we are OK with this implementation (today I will write tests for the new functionality, as current test coverage of joins is not very good).

bkamins · 2021-01-29T14:01:28Z

I have added tests and NEWS.md (@andyferris - in the end I use a much more complex algorithm than the one in SplitApplyCombine.jl). This should be ready for a review.

bkamins · 2021-01-29T14:06:40Z

@nalimilan - the only thing left to add is probably special path for PooledArray.jl and CategoricalArray.jl (but it should not significantly change the logic I have now). I will try to work on it during this weekend.

bkamins · 2021-01-29T15:43:26Z

@nalimilan Here are timings for PooledArray. The conclusion is that we are better, but it should be possible to be much better here than we are now. The question is if we should do "pool merging" (which can be also expensive) or some other approach. In particular it would be useful to have invrefpool in DataAPI.jl along with refpool.

current main:

julia> using Random, DataFrames, PooledArrays

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = PooledArray(sort!(string.(1:10^6))));

julia> df2 = DataFrame(id = PooledArray(sort!(string.(1:10^7))));

julia> @time innerjoin(df1, df2, on=:id);
  3.023138 seconds (259 allocations: 762.870 MiB, 37.87% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 20.032685 seconds (259 allocations: 299.547 MiB)

julia> df1 = DataFrame(id = PooledArray((shuffle!(string.(1:10^6)))));

julia> df2 = DataFrame(id = PooledArray((shuffle!(string.(1:10^7)))));

julia> @time innerjoin(df1, df2, on=:id);
  5.601759 seconds (259 allocations: 762.870 MiB, 30.31% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 25.203538 seconds (259 allocations: 299.547 MiB, 3.24% gc time)

julia> df1 = DataFrame(id = PooledArray((sort!(rand(string.(1:10^6), 10^6)))));

julia> df2 = DataFrame(id = PooledArray((sort!(rand(string.(1:10^7), 10^7)))));

julia> @time innerjoin(df1, df2, on=:id);
  2.053148 seconds (252 allocations: 676.680 MiB, 25.71% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.849917 seconds (252 allocations: 263.892 MiB)

julia> df1 = DataFrame(id = PooledArray((rand(string.(1:10^6), 10^6))));

julia> df2 = DataFrame(id = PooledArray((rand(string.(1:10^7), 10^7))));

julia> @time innerjoin(df1, df2, on=:id);
  4.973785 seconds (254 allocations: 692.068 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  6.831288 seconds (254 allocations: 279.243 MiB, 19.35% gc time)

this PR:

julia> using Random, DataFrames, PooledArrays

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = PooledArray(sort!(string.(1:10^6))));

julia> df2 = DataFrame(id = PooledArray(sort!(string.(1:10^7))));

julia> @time innerjoin(df1, df2, on=:id);
  0.277155 seconds (160 allocations: 52.715 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  1.334287 seconds (160 allocations: 303.380 MiB, 62.15% gc time)

julia> df1 = DataFrame(id = PooledArray((shuffle!(string.(1:10^6)))));

julia> df2 = DataFrame(id = PooledArray((shuffle!(string.(1:10^7)))));

julia> @time innerjoin(df1, df2, on=:id);
  4.865085 seconds (252 allocations: 120.628 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  5.070394 seconds (252 allocations: 371.292 MiB)

julia> df1 = DataFrame(id = PooledArray((sort!(rand(string.(1:10^6), 10^6)))));

julia> df2 = DataFrame(id = PooledArray((sort!(rand(string.(1:10^7), 10^7)))));

julia> @time innerjoin(df1, df2, on=:id);
  0.327582 seconds (160 allocations: 36.904 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.510882 seconds (160 allocations: 275.302 MiB)

julia> df1 = DataFrame(id = PooledArray((rand(string.(1:10^6), 10^6))));

julia> df2 = DataFrame(id = PooledArray((rand(string.(1:10^7), 10^7))));

julia> @time innerjoin(df1, df2, on=:id);
  6.056494 seconds (219 allocations: 108.253 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  6.255971 seconds (219 allocations: 346.669 MiB)

Having a generic access to `invrefpool` is needed in JuliaData/DataFrames.jl#2612. Consider a short table and a long table joined on some column. In order to be fast we need to map values from short table key to ref values of long table key. This allows two things for `innerjoin`: 1. we immediately can drop values from short table not present in long table. 2. later we can do join on integer columns which is way faster than joining on e.g. string column. Also since we do mapping of short table this operation should be fast. In particular if short table defines `refarray` it is particularly fast, as we only need to map the reference values. For CategoricalArrays.jl and PooledArrays.jl `invrefpool` is simply `get` on the inverted pool `Dict` with `nothing` as a sentinel. I am not sure what would have to be defined in Arrow.jl.

pdeffebach · 2021-02-10T14:18:29Z

I have a .zip prepared for an MWE with a -deved version of DataFrames contained. I sent it to Jacob last night since I think the error is from CSV.

I should have also sent it to you both. Any ideas on how best to share it?

bkamins · 2021-02-10T14:49:04Z

@nalimilan - I have added sizehint! annotations that improve performance a bit (but I need to sort out the issue with the problem with my benchmarking code before merging)

ronisbr · 2021-02-10T16:13:13Z

I should have also sent it to you both. Any ideas on how best to share it?

Can you please send it to my email? It is my Github user @ gmail.

bkamins · 2021-02-10T17:00:08Z

I have checked the code on:

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

and unfortunately (as it would be better if we could reproduce it) it does not error.

ronisbr · 2021-02-10T17:06:58Z

I'm sorry, it was my fault 😊

The bug @pdeffebach discovered was a very interesting one involving the display size, the alignment of floating points, and the cropping algorithm. The fix, fortunately, is very simple. I will release a new version 0.11.1 with this fix in the following minutes.

bkamins · 2021-02-10T17:30:57Z

@pdeffebach + @ronisbr - thank you form making JuliaData ecosystem better and better every day!

ronisbr · 2021-02-10T17:46:43Z

@pdeffebach + @ronisbr - thank you form making JuliaData ecosystem better and better every day!

Thanks! And sorry for this bug. It was really an important case that I totally missed when creating the alignment algorithm.

@pdeffebach PrettyTables.jl v0.11.1 was just released. Can you please try again to see if this bug is fixed?

bkamins · 2021-02-10T18:23:02Z

@nalimilan - to keep track of the GC issue I am uploading the files here (as Discourse disallows it).

Example code:
example.txt

Timing results:
timing.txt

pdeffebach · 2021-02-11T01:09:55Z

I really thought this was a PooledArrays or SentinalArrays issue.

However @ronisbr's updated version of PrettyTables made this work!

This reverts commit a3de1c2.

bkamins · 2021-02-11T17:41:40Z

@nalimilan. I have benchmarked the code thoroughly and it is consistently not worse. I have pushed the design of benchmarking code that most of the times does not lead to horrendous GC (but sometimes it still does especially when working with CategoricalArrays.jl).

Also I have experimented with not using internal functions in this commit a3de1c2. It reduces the performance by ~ 5%. I have reverted this commit, but maybe we can sacrifice this 5% for the sake of cleaner code.

Apart from this decision this PR is done.

nalimilan · 2021-02-11T17:53:34Z

5% sounds acceptable if that makes the code less likely to break in the future.

This reverts commit 07ecb0a.

test/join.jl

bkamins · 2021-02-12T08:22:54Z

@nalimilan - I have reviewed everything again and pushed some small cleanups. Could you please have a look at the PR for the last time before I merge? Thank you!

NEWS.md

bkamins · 2021-02-13T11:43:49Z

Thank you!

The next two PRs are:

faster leftjoin, rigthjoin and outerjoin
faster semijoin and antijoin

implement faster innerjoin

1509f63

bkamins added non-breaking The proposed change is not breaking performance labels Jan 26, 2021

bkamins added this to the 1.0 milestone Jan 26, 2021

bkamins added the priority label Jan 26, 2021

add handling of sorted tables

2b222b7

fix eltype test

0eb911e

use strategy with single index pool in case of duplicates

a16b6f2

add tests for innerjoin

14652f0

bkamins marked this pull request as ready for review January 29, 2021 14:00

add sizehint!

020eaae

ronisbr mentioned this pull request Feb 10, 2021

Alignment anchor is accessing columns outside the print interval ronisbr/PrettyTables.jl#112

Closed

bkamins added 5 commits February 11, 2021 09:27

avoid using internal functions

a3de1c2

improved benchmark design

659ec7c

Revert "avoid using internal functions"

07ecb0a

This reverts commit a3de1c2.

fix dict sizehint

58bdcf3

add benchmark runner

d7bb989

bkamins added 2 commits February 11, 2021 20:16

Revert "Revert "avoid using internal functions""

f9882f8

This reverts commit 07ecb0a.

clean up script

8a31d99

bkamins commented Feb 11, 2021

View reviewed changes

test/join.jl Show resolved Hide resolved

bkamins added 2 commits February 12, 2021 00:04

Update test/join.jl

91df0e4

improve tests

0b21972

nalimilan approved these changes Feb 12, 2021

View reviewed changes

bkamins commented Feb 13, 2021

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Update NEWS.md

1a9e664

bkamins merged commit 726b4e4 into JuliaData:main Feb 13, 2021

bkamins deleted the new_faster_innerjoin branch February 13, 2021 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement faster innerjoin #2612

implement faster innerjoin #2612

bkamins commented Jan 26, 2021

nalimilan commented Jan 27, 2021

bkamins commented Jan 27, 2021

bkamins commented Jan 27, 2021

bkamins commented Jan 27, 2021

nalimilan commented Jan 27, 2021

bkamins commented Jan 27, 2021

bkamins commented Jan 27, 2021

bkamins commented Jan 27, 2021

bkamins commented Jan 28, 2021

nalimilan commented Jan 28, 2021

bkamins commented Jan 28, 2021

nalimilan commented Jan 28, 2021

bkamins commented Jan 28, 2021

bkamins commented Jan 28, 2021

bkamins commented Jan 28, 2021

bkamins commented Jan 29, 2021

bkamins commented Jan 29, 2021

bkamins commented Jan 29, 2021

bkamins commented Jan 29, 2021

pdeffebach commented Feb 10, 2021

bkamins commented Feb 10, 2021

ronisbr commented Feb 10, 2021

bkamins commented Feb 10, 2021

ronisbr commented Feb 10, 2021 •

edited

Loading

bkamins commented Feb 10, 2021

ronisbr commented Feb 10, 2021

bkamins commented Feb 10, 2021

pdeffebach commented Feb 11, 2021

bkamins commented Feb 11, 2021

nalimilan commented Feb 11, 2021

bkamins commented Feb 12, 2021

bkamins commented Feb 13, 2021

implement faster innerjoin #2612

implement faster innerjoin #2612

Conversation

bkamins commented Jan 26, 2021

nalimilan commented Jan 27, 2021

bkamins commented Jan 27, 2021

bkamins commented Jan 27, 2021

bkamins commented Jan 27, 2021

smaller data

bigger data

nalimilan commented Jan 27, 2021

smaller data

bigger data

bkamins commented Jan 27, 2021

smaller data

larger data

bkamins commented Jan 27, 2021

bkamins commented Jan 27, 2021

bkamins commented Jan 28, 2021

nalimilan commented Jan 28, 2021

bkamins commented Jan 28, 2021

nalimilan commented Jan 28, 2021

bkamins commented Jan 28, 2021

bkamins commented Jan 28, 2021

bkamins commented Jan 28, 2021

small string:

large string:

small int:

large int:

small mixed:

bkamins commented Jan 29, 2021

bkamins commented Jan 29, 2021

bkamins commented Jan 29, 2021

bkamins commented Jan 29, 2021

pdeffebach commented Feb 10, 2021

bkamins commented Feb 10, 2021

ronisbr commented Feb 10, 2021

bkamins commented Feb 10, 2021

ronisbr commented Feb 10, 2021 • edited Loading

bkamins commented Feb 10, 2021

ronisbr commented Feb 10, 2021

bkamins commented Feb 10, 2021

pdeffebach commented Feb 11, 2021

bkamins commented Feb 11, 2021

nalimilan commented Feb 11, 2021

bkamins commented Feb 12, 2021

bkamins commented Feb 13, 2021

ronisbr commented Feb 10, 2021 •

edited

Loading