-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precompilation of DataFrames #1502
Comments
Good idea. We should also do this for dependencies, in particular CategoricalArrays. A reasonable way of doing that would be to run the full test suite under SnoopCompile.jl. |
Honestly I haven't found using |
Given how big is the DataFrames API, it doesn't really sound possible to call all functions from |
+1 for |
While discussing the h2oai benchmarks, it appeared that it would be useful to provide a function or script to compile functions for all common types. That would allow reporting times net of the initial compilation cost, which is generally only paid once, and in particular would allow separating it from the cost of specializing some functions on the particular operation at hand. See discussion at h2oai/db-benchmark#69. I guess an easy way to do that would be to copy the output of SnoopCompile to an unexported function, which could be called by the h2oai benchmark, and possibly by users who want to put that in their |
I think it would be worth to do it at least to understand what should be compiled and how long this compilation takes. Then I guess people who want to build a custom sysimg instead of using |
Right. Let's try using SnoopCompile while running the full test suite. |
@KristofferC Is this still true for Julia 1.4? Here is a comparison (I am annotating Without precompile:
With precompile:
So using precompile increases load time by ~0.4 sec, but decreases time to first result by ~1.4 sec . So in total 1 second is saved. |
It might have changed or I didn't have a correct assessment earlier. I know Plots.jl has added some and it seems to help there. So if the numbers say it helps, I say go for it. |
@nalimilan - so I will put out all new |
@nalimilan - I have checked what we have now. We could handle most of the compilation latency after the package is loaded if we included the following in the init file:
It would make the package startup time grow from 0.9 sec to 5.5 sec. So for now this is for sure prohibitive, but maybe when we separate DataFramesBase.jl actually it could be acceptable to have such code in DataFrames.jl? It is normal for other packages to load in 5-10 seconds (I do not want to say that this is something users like of course). (we could alternatively do |
I really dislike packages running unnecessary code at startup. What if I only use the DataFrames functionality spuriously in my session, why are you adding 5 extra seconds of completely avoidable latency to me? It can also introduce invalidations that cause other things to have to be recompiled. Unconditionally shifting eventual compilation cost to unconditional package load time cost seems like a very bad trade off. Ref, JuliaData/CSV.jl#325. |
@KristofferC - sure. BTW: are you aware of the current state of the work towards better caching the compiled code between Julia sessions? |
I agree it doesn't sound like a good tradeoff to impose 5s on every load to make things faster afterwards. And we never know what kind of functions users will call, so we'd have to do that for all functions... Would adding these to precompilation give some speedup? With SnoopCompile it should be easy to transform a script into a series of |
As discussed with @nalimilan - for now we can add an unexported Also this function can be used by the users in combination with PackageCompiler.jl. @nalimilan - I am assigning you to this issue just to make sure it is kept track of, as this is an old issue. I hope it is OK with you. |
For reference, taking the radical approach of precompiling everything that is used by the tests (#2456, which takes more than 3 minutes!), here are the timings I get in a fresh session: julia> @time using DataFrames
3.241466 seconds (2.29 M allocations: 187.625 MiB, 1.72% gc time)
julia> df = DataFrame(x=[1, 1, 2, 2], y=rand(4));
julia> @time combine(groupby(df, :x), :y => sum);
3.899736 seconds (4.45 M allocations: 236.128 MiB, 8.39% gc time) As opposed to current master: julia> @time using DataFrames
0.981128 seconds (1.23 M allocations: 81.321 MiB)
julia> df = DataFrame(x=[1, 1, 2, 2], y=rand(4));
julia> @time combine(groupby(df, :x), :y => sum);
6.365692 seconds (16.59 M allocations: 850.439 MiB, 7.69% gc time) So it looks like precompilation can make a difference, but at the cost of increasing the load time. Though clearly we don't want to precompile so many things. Maybe we could identify a few priorities which would give a reasonable tradeoff. |
Thank you for working on it. Clearly there is too much in the precompilation (as e.g. there are many very specific precompilations that relate to concrete field column, like Also there is an issue of maintenance of |
If we keep track of the code needed to generate the precompile statements, we can run it for each new major release without too much work. It's not the end of the world if it's not completely up to date though. One way to compile only the major functions is to use julia> @time using DataFrames
1.832080 seconds (1.73 M allocations: 126.133 MiB)
julia> df = DataFrame(x=[1, 1, 2, 2], y=rand(4));
julia> @time combine(groupby(df, :x), :y => sum);
3.611778 seconds (5.33 M allocations: 281.904 MiB, 7.17% gc time) We could use a higher threshold -- but with |
Actually for H2O benchmark I think that it would be even sensible to build a custom Julia image (I guess this is OK with the H2O rules - as normally user would do this). |
That would be more complex for a limited gain, wouldn't it? |
It would, but it would "kind of" show the real recommended deployment pattern. |
Often it isn't too bad to write the statements manually. Having a few well targeted statements usually gets you quite a long way. |
Fixed by #2456. |
I think it would be good to put some most common cases of functions used in DataFrames.jl to the module to force their precompilation as this will improve user experience (and DataFrames.jl is probably one of the packages that new users start with). Any opinions if we should do it and how (I could manually list the cases I think are relevant and code it by hand in the worst case).
The text was updated successfully, but these errors were encountered: