Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use dict to cache eltype names #2750

Merged
merged 8 commits into from
May 7, 2021
Merged

use dict to cache eltype names #2750

merged 8 commits into from
May 7, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 5, 2021

Fixes #2739

This change should make us faster in most cases (even narrow tables should be improved). I have not done much performance testing so some additional checks are welcome.

The one benchmark is:

This PR:

julia> df = DataFrame(rand(1,10^5),:auto);

julia> @time show(df)
2.086479 seconds (5.54 M allocations: 335.747 MiB, 7.02% gc time, 94.93% compilation time)

julia> @time show(df)
0.106020 seconds (501.91 k allocations: 40.995 MiB, 8.64% gc time)

julia> allowmissing!(df);

julia> @time show(df)
0.206911 seconds (622.98 k allocations: 48.133 MiB, 5.08% gc time, 52.31% compilation time)

julia> @time show(df)
0.099357 seconds (501.88 k allocations: 40.994 MiB, 9.74% gc time)

and DataFrames.jl 1.1:

julia> df = DataFrame(rand(1,10^5),:auto);

julia> @time show(df)
1.994571 seconds (5.40 M allocations: 325.074 MiB, 7.43% gc time, 91.25% compilation time)

julia> @time show(df)
0.168648 seconds (801.91 k allocations: 56.254 MiB, 11.60% gc time)

julia> allowmissing!(df);

julia> @time show(df)
108.357185 seconds (20.45 M allocations: 5.339 GiB, 0.45% gc time, 0.05% compilation time)

julia> @time show(df)
112.724588 seconds (20.40 M allocations: 5.336 GiB, 0.34% gc time)

@bkamins bkamins requested a review from ronisbr May 5, 2021 20:54
@ronisbr
Copy link
Member

ronisbr commented May 6, 2021

LGTM! Just one question: why can't we move the lock and unlock inside the function compacttype to avoid calling adding those functions every time compacttype is called?

@bkamins
Copy link
Member Author

bkamins commented May 6, 2021

why can't we move the lock and unlock inside the function compacttype to avoid calling adding those functions every time compacttype is called?

The point is that locking is not done every time compacttype is called. It is only called once per chain of compacttype calls.
So if you have 10^5 columns you avoid many lock and unlock calls. And lock/unlock is more expensive than dict lookup.

@nalimilan - I will wait for your approval before merging.

@ronisbr
Copy link
Member

ronisbr commented May 6, 2021

Thanks @bkamins . You are right. I did not realize that those broadcasts actually are calling the function many, many times.

@nalimilan
Copy link
Member

Instead of taking the lock everywhere, wouldn't it be almost as fast (and simpler) to use a vectorized compacttype which allocates a temporary dict each time it's called? I imagine that the number of different types is small so it shouldn't cost much. That requires storing the result of compacttype in a vector with one entry per column, but that shouldn't be too large, right?

src/abstractdataframe/show.jl Outdated Show resolved Hide resolved
src/abstractdataframe/io.jl Outdated Show resolved Hide resolved
src/abstractdataframe/show.jl Outdated Show resolved Hide resolved
src/abstractdataframe/show.jl Outdated Show resolved Hide resolved
@bkamins bkamins changed the title use global pool of column eltype names use dict to cache eltype names May 6, 2021
@bkamins
Copy link
Member Author

bkamins commented May 6, 2021

@nalimilan I have removed global state as you requested + improved inference a bit (so we have a shorter time to first table even in simple cases although we have a more complex logic; because of this I resolve your comments as they do not apply):

Test 1

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(a=1); @time show(x)"
1.627265 seconds (4.10 M allocations: 240.722 MiB, 3.88% gc time, 99.93% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(a=1); @time show(x)"
1.695662 seconds (4.30 M allocations: 253.672 MiB, 4.92% gc time, 99.94% compilation time)

Test 2

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(); @time show(x)"
0×0 DataFrame  1.403435 seconds (3.59 M allocations: 211.961 MiB, 5.91% gc time, 99.97% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(); @time show(x)"
0×0 DataFrame  1.574054 seconds (4.13 M allocations: 243.455 MiB, 6.07% gc time, 99.98% compilation time)

Test 3

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(rand(Int, 1, 10^5), :auto); @time show(x)"
1.827239 seconds (4.52 M allocations: 277.657 MiB, 5.24% gc time, 94.61% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(rand(Int, 1, 10^5), :auto); @time show(x)"
1.878271 seconds (5.02 M allocations: 304.266 MiB, 5.36% gc time, 91.61% compilation time)

Test 4

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(rand([1,missing], 1, 10^5), :auto); @time show(x)"
1.770344 seconds (4.57 M allocations: 280.561 MiB, 5.32% gc time, 95.93% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(rand([1,missing], 1, 10^5), :auto); @time show(x)"
1×100000 DataFrame
105.916926 seconds (24.64 M allocations: 5.580 GiB, 0.57% gc time, 1.72% compilation time)

Test 5

~/Desktop/Dev/DF_dev$ julia --project -e "using DataFrames; x = DataFrame(a=1,b=true,c=1.0,d='1',e="1",f=big(1)); @time show(x)"
1.958219 seconds (5.04 M allocations: 293.045 MiB, 4.36% gc time, 99.90% compilation time)
~/Desktop/Dev/DF_dev$ julia -e "using DataFrames; x = DataFrame(a=1,b=true,c=1.0,d='1',e="1",f=big(1)); @time show(x)"
1.952368 seconds (4.96 M allocations: 291.137 MiB, 4.86% gc time, 99.91% compilation time)

src/abstractdataframe/show.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented May 7, 2021

@nalimilan - I have additionally moved getmaxwidths to abstractdataframe/io.jl as it is the only place where it is used (so when we switch to support LaTeX/HTML via PrettyTables.jl we do not forget to remove it)

@@ -68,7 +68,27 @@ if VERSION < v"1.5.0-DEV.261" || VERSION < v"1.5.0-DEV.266"
end
end

"""Return compact string representation of type T"""
function batch_compacttype(types::Vector{Any}, maxwidths::AbstractVector{Int},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose between Vector and AbstractVector? :-) Same below.

Copy link
Member Author

@bkamins bkamins May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to use Vector everywhere to signal we want to avoid specialization for different types.

@@ -68,7 +68,27 @@ if VERSION < v"1.5.0-DEV.261" || VERSION < v"1.5.0-DEV.266"
end
end

"""Return compact string representation of type T"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain what's the point of having this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it. :-D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - added. I was looking at another view on GitHub and thought you want me to expand the docstring of compacteltype (which I did)

@bkamins
Copy link
Member Author

bkamins commented May 7, 2021

I have also removed initial argument to the functions as it is not used any more (such things are hard to track as they are not caught by coverage tests).

@nalimilan
Copy link
Member

Do you know if the coverage misses are real? That's not the fault of this PR, but could be worth checking.

@bkamins bkamins merged commit 818cb11 into main May 7, 2021
@bkamins bkamins deleted the fix_compacttype branch May 7, 2021 09:58
@bkamins
Copy link
Member Author

bkamins commented May 7, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrames with many columns are too slow (because of show())
3 participants