use dict to cache eltype names #2750

bkamins · 2021-05-05T20:54:50Z

This change should make us faster in most cases (even narrow tables should be improved). I have not done much performance testing so some additional checks are welcome.

The one benchmark is:

This PR:

julia> df = DataFrame(rand(1,10^5),:auto);

julia> @time show(df)
2.086479 seconds (5.54 M allocations: 335.747 MiB, 7.02% gc time, 94.93% compilation time)

julia> @time show(df)
0.106020 seconds (501.91 k allocations: 40.995 MiB, 8.64% gc time)

julia> allowmissing!(df);

julia> @time show(df)
0.206911 seconds (622.98 k allocations: 48.133 MiB, 5.08% gc time, 52.31% compilation time)

julia> @time show(df)
0.099357 seconds (501.88 k allocations: 40.994 MiB, 9.74% gc time)

and DataFrames.jl 1.1:

julia> df = DataFrame(rand(1,10^5),:auto);

julia> @time show(df)
1.994571 seconds (5.40 M allocations: 325.074 MiB, 7.43% gc time, 91.25% compilation time)

julia> @time show(df)
0.168648 seconds (801.91 k allocations: 56.254 MiB, 11.60% gc time)

julia> allowmissing!(df);

julia> @time show(df)
108.357185 seconds (20.45 M allocations: 5.339 GiB, 0.45% gc time, 0.05% compilation time)

julia> @time show(df)
112.724588 seconds (20.40 M allocations: 5.336 GiB, 0.34% gc time)

ronisbr · 2021-05-06T12:24:41Z

LGTM! Just one question: why can't we move the lock and unlock inside the function compacttype to avoid calling adding those functions every time compacttype is called?

bkamins · 2021-05-06T13:30:12Z

why can't we move the lock and unlock inside the function compacttype to avoid calling adding those functions every time compacttype is called?

The point is that locking is not done every time compacttype is called. It is only called once per chain of compacttype calls.
So if you have 10^5 columns you avoid many lock and unlock calls. And lock/unlock is more expensive than dict lookup.

@nalimilan - I will wait for your approval before merging.

ronisbr · 2021-05-06T14:18:51Z

Thanks @bkamins . You are right. I did not realize that those broadcasts actually are calling the function many, many times.

nalimilan · 2021-05-06T14:48:19Z

Instead of taking the lock everywhere, wouldn't it be almost as fast (and simpler) to use a vectorized compacttype which allocates a temporary dict each time it's called? I imagine that the number of different types is small so it shouldn't cost much. That requires storing the result of compacttype in a vector with one entry per column, but that shouldn't be too large, right?

src/abstractdataframe/show.jl

src/abstractdataframe/io.jl

src/abstractdataframe/show.jl

This reverts commit 6685c73.

bkamins · 2021-05-06T16:48:39Z

@nalimilan I have removed global state as you requested + improved inference a bit (so we have a shorter time to first table even in simple cases although we have a more complex logic; because of this I resolve your comments as they do not apply):

Test 1

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(a=1); @time show(x)"
1.627265 seconds (4.10 M allocations: 240.722 MiB, 3.88% gc time, 99.93% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(a=1); @time show(x)"
1.695662 seconds (4.30 M allocations: 253.672 MiB, 4.92% gc time, 99.94% compilation time)

Test 2

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(); @time show(x)"
0×0 DataFrame  1.403435 seconds (3.59 M allocations: 211.961 MiB, 5.91% gc time, 99.97% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(); @time show(x)"
0×0 DataFrame  1.574054 seconds (4.13 M allocations: 243.455 MiB, 6.07% gc time, 99.98% compilation time)

Test 3

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(rand(Int, 1, 10^5), :auto); @time show(x)"
1.827239 seconds (4.52 M allocations: 277.657 MiB, 5.24% gc time, 94.61% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(rand(Int, 1, 10^5), :auto); @time show(x)"
1.878271 seconds (5.02 M allocations: 304.266 MiB, 5.36% gc time, 91.61% compilation time)

Test 4

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(rand([1,missing], 1, 10^5), :auto); @time show(x)"
1.770344 seconds (4.57 M allocations: 280.561 MiB, 5.32% gc time, 95.93% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(rand([1,missing], 1, 10^5), :auto); @time show(x)"
1×100000 DataFrame
105.916926 seconds (24.64 M allocations: 5.580 GiB, 0.57% gc time, 1.72% compilation time)

Test 5

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(a=1,b=true,c=1.0,d='1',e="1",f=big(1)); @time show(x)"
1.958219 seconds (5.04 M allocations: 293.045 MiB, 4.36% gc time, 99.90% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(a=1,b=true,c=1.0,d='1',e="1",f=big(1)); @time show(x)"
1.952368 seconds (4.96 M allocations: 291.137 MiB, 4.86% gc time, 99.91% compilation time)

src/abstractdataframe/show.jl

bkamins · 2021-05-07T07:11:28Z

@nalimilan - I have additionally moved getmaxwidths to abstractdataframe/io.jl as it is the only place where it is used (so when we switch to support LaTeX/HTML via PrettyTables.jl we do not forget to remove it)

nalimilan · 2021-05-07T07:28:35Z

src/abstractdataframe/show.jl

@@ -68,7 +68,27 @@ if VERSION < v"1.5.0-DEV.261" || VERSION < v"1.5.0-DEV.266"
    end
 end

-"""Return compact string representation of type T"""
+function batch_compacttype(types::Vector{Any}, maxwidths::AbstractVector{Int},


Choose between Vector and AbstractVector? :-) Same below.

I opted to use Vector everywhere to signal we want to avoid specialization for different types.

nalimilan · 2021-05-07T07:29:39Z

src/abstractdataframe/show.jl

@@ -68,7 +68,27 @@ if VERSION < v"1.5.0-DEV.261" || VERSION < v"1.5.0-DEV.266"
    end
 end

-"""Return compact string representation of type T"""


Add a comment to explain what's the point of having this function?

I don't see it. :-D

ah - added. I was looking at another view on GitHub and thought you want me to expand the docstring of compacteltype (which I did)

bkamins · 2021-05-07T07:51:10Z

I have also removed initial argument to the functions as it is not used any more (such things are hard to track as they are not caught by coverage tests).

nalimilan · 2021-05-07T08:18:07Z

Do you know if the coverage misses are real? That's not the fault of this PR, but could be worth checking.

bkamins · 2021-05-07T09:58:14Z

Thank you!

use global pool of column eltype names

6685c73

bkamins requested a review from ronisbr May 5, 2021 20:54

ronisbr approved these changes May 6, 2021

View reviewed changes

nalimilan reviewed May 6, 2021

View reviewed changes

src/abstractdataframe/show.jl Outdated Show resolved Hide resolved

src/abstractdataframe/io.jl Outdated Show resolved Hide resolved

src/abstractdataframe/show.jl Outdated Show resolved Hide resolved

src/abstractdataframe/show.jl Outdated Show resolved Hide resolved

Revert "use global pool of column eltype names"

6dd0751

This reverts commit 6685c73.

bkamins changed the title ~~use global pool of column eltype names~~ use dict to cache eltype names May 6, 2021

do not use global state

3399b6d

bkamins commented May 6, 2021

View reviewed changes

src/abstractdataframe/show.jl Outdated Show resolved Hide resolved

bkamins added 2 commits May 6, 2021 19:21

Update src/abstractdataframe/show.jl

3234191

move getmaxwidths to abstractdataframe/io.jl

4a45826

bkamins mentioned this pull request May 7, 2021

fix performance issue in multirow split-apply-combine #2749

Merged

nalimilan reviewed May 7, 2021

View reviewed changes

changes after code review

f45509b

add explanation of batch_compacteltype

6152616

nalimilan approved these changes May 7, 2021

View reviewed changes

fix unnecessary line

513bca0

bkamins merged commit 818cb11 into main May 7, 2021

bkamins deleted the fix_compacttype branch May 7, 2021 09:58

bkamins mentioned this pull request May 7, 2021

Cover corner case of compactype (wide name and CategoricalValue) #2751

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use dict to cache eltype names #2750

use dict to cache eltype names #2750

bkamins commented May 5, 2021

ronisbr commented May 6, 2021

bkamins commented May 6, 2021

ronisbr commented May 6, 2021

nalimilan commented May 6, 2021

bkamins commented May 6, 2021

bkamins commented May 7, 2021

nalimilan May 7, 2021

bkamins May 7, 2021 •

edited

Loading

nalimilan May 7, 2021

bkamins May 7, 2021

nalimilan May 7, 2021

bkamins May 7, 2021

bkamins commented May 7, 2021

nalimilan commented May 7, 2021

bkamins commented May 7, 2021

use dict to cache eltype names #2750

use dict to cache eltype names #2750

Conversation

bkamins commented May 5, 2021

ronisbr commented May 6, 2021

bkamins commented May 6, 2021

ronisbr commented May 6, 2021

nalimilan commented May 6, 2021

bkamins commented May 6, 2021

Test 1

Test 2

Test 3

Test 4

Test 5

bkamins commented May 7, 2021

nalimilan May 7, 2021

Choose a reason for hiding this comment

bkamins May 7, 2021 • edited Loading

Choose a reason for hiding this comment

nalimilan May 7, 2021

Choose a reason for hiding this comment

bkamins May 7, 2021

Choose a reason for hiding this comment

nalimilan May 7, 2021

Choose a reason for hiding this comment

bkamins May 7, 2021

Choose a reason for hiding this comment

bkamins commented May 7, 2021

nalimilan commented May 7, 2021

bkamins commented May 7, 2021

bkamins May 7, 2021 •

edited

Loading