improve precompilation coverage #3285

bkamins · 2023-02-05T21:13:06Z

Fixes #3248

To do:

select precompilation statements
decide what to do with InlineStrings.jl and SentinelArrays.jl

Now I implemented step 1 (select precompilation statements)

Here are some statistics:

Julia 1.9 `main` branch (old precompilation)

precompilation time: 36.419711 seconds
DataFrames.jl load time later: 1.408057
execution of code that is proposed to be used in precompilation (new set of precompile statements): 5.860520 seconds

Julia 1.9 this PR (new precompilation)

precompilation time: 45.814128 seconds
DataFrames.jl load time later: 1.587016
execution of code that is proposed to be used in precompilation (new set of precompile statements): 0.394517 seconds

Julia 1.8.5 this PR (new precompilation)

precompilation time: 21.730528 seconds
DataFrames.jl load time later: 2.356902
execution of code that is proposed to be used in precompilation (new set of precompile statements): 13.682346 seconds

In general my recommendation is to use the long list of precompilation statements. It adds 9 seconds to precompilation and 0.1 second to load time (but hopefully users will accept this; maybe the only problematic place is Pluto.jl, so let us discuss this). The benefit is that we precompile all commonly used functions.

Decide what to do with InlineStrings.jl and SentinelArrays.jl

After we settle the decision on step 1, we need what to do with InlineStrings.jl and SentinelArrays.jl. I will benchmark it later (after we decide what precompilations to keep). We have three options in general:

do not add them
add them
add CSV.jl as a hidden dependency (in this way when CSV.jl changes its dependencies we will automatically track them). Also then I could add a simple precompilation statement for loading CSV file in DataFrames.jl, so user experience of time of loading CSV files to a DataFrame should be improved.

@nalimilan, @quinnj, @timholy - do you have any opinion? Thank you!

bkamins · 2023-02-05T21:27:26Z

Also maybe CSV.jl should be handled by as extension? (we would just then need to ensure that we precompile things in a way that we avoid invalidations). If you have some experience here what is best please comment. Thank you!

bkamins · 2023-02-05T21:48:42Z

As a comment, we probably indeed need to fix these invalidations. Here is what I have when both CSV.jl and DataFrames.jl are loaded in the "Julia 1.9 this PR (new precompilation)" scenario:

julia> @time using CSV
  0.504498 seconds (849.83 k allocations: 54.461 MiB, 3.65% gc time, 2.13% compilation time)

julia> @time using DataFrames
  1.858852 seconds (2.67 M allocations: 169.493 MiB, 3.31% gc time, 34.05% compilation time: 100% of which was recompilation)

and then running the operations in the precompilation part takes 4.684947 seconds (while without CSV.jl it takes 0.394517 so indeed we loose almost all benefits of precompilation)

timholy · 2023-02-05T23:07:25Z

Nice!

Does it fix most of that recompile time if you depend on InlineStrings & SentinelArrays?

bkamins · 2023-02-06T08:48:43Z

If I add InlineStrings.jl and SentinelArrays.jl to dependencies AND include them (i.e. only having them in dependencies is not enough), then the time that is affected is running the test code (all else is comparable) and it is:

0.862469 seconds (1.32 M allocations: 74.504 MiB, 2.16% gc time, 97.84% compilation time: 54% of which was recompilation)

So there is recompilation but much less.

If I add CSV.jl as a dependency instead then:

precompilation time goes up to 52.912350 seconds (not that bad)
then load time of DataFrames.jl goes up to 2.225259 seconds (a bit more but not prohibitive)
time to run the benchmark without loading CSV.jl: 0.360435 seconds (good)
and the final timings:

julia> @time using CSV
  0.752680 seconds (850.07 k allocations: 54.492 MiB, 2.51% gc time, 1.25% compilation time)

julia> @time using DataFrames
  2.002735 seconds (3.01 M allocations: 189.720 MiB, 3.83% gc time, 30.74% compilation time: 100% of which was recompilation)

julia> @time # running all the benchmark codes
  0.363915 seconds (105.03 k allocations: 5.971 MiB, 95.26% compilation time)

julia> @time CSV.read("test.csv", DataFrame) # and this is something that is really nice - a big bonus of fast first time to read CSV as DataFrame
  0.059060 seconds (26.87 k allocations: 1.770 MiB, 97.05% compilation time)

So all is good if we load CSV.jl (although we get recompilation when loading DataFrames.jl - @timholy: can you tell why?).

In summary: it looks like adding CSV.jl as a dependency would be the best option. The question is if it is worth to make it a conditional dependency (probably yes, but I have not benchmarked it).

Also @quinnj - CSV.jl is now on 0.10.9 version. What are the plans for further development/versions of CSV.jl? (the issue is what compat bounds to put into Project.toml if we decide to go forward with adding CSV.jl as a dependency)

bkamins · 2023-02-06T08:50:31Z

I have pushed the version with CSV.jl as a dependency (simple version - no conditional loading) if someone is interested in testing this.

bkamins · 2023-02-06T08:59:38Z

Julia complains that the following method definitions are ambiguous:

reduce(::typeof(vcat), dfs::Union{Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}, AbstractVector{<:AbstractDataFrame}}; cols, source)
reduce(op::OP, x::SentinelArrays.ChainedVector) where OP

I will fix this when we make a decision what to include as dependencies.

EDIT: fixed

timholy · 2023-02-06T13:39:49Z

So all is good if we load CSV.jl (although we get recompilation when loading DataFrames.jl - @timholy: can you tell why?).

Do you get recompilation if you use --startup=no? I see there are several sources of Revise invalidation (I keep finding those...), will try to fix.

bkamins · 2023-02-06T14:39:17Z

Everything above is without Revise.jl and with --startup=no.

timholy · 2023-02-06T15:06:29Z

Fixes for the Revise stack:

So all is good if we load CSV.jl (although we get recompilation when loading DataFrames.jl - @timholy: can you tell why?).

Base.require invalidation 😢 :

Packages that define new AbstractString subtypes are tricky!

timholy · 2023-02-06T15:14:32Z

JuliaLang/julia#48557

quinnj · 2023-02-06T16:51:01Z

Also @quinnj - CSV.jl is now on 0.10.9 version. What are the plans for further development/versions of CSV.jl? (the issue is what compat bounds to put into Project.toml if we decide to go forward with adding CSV.jl as a dependency)

Yeah, I've been a little tied up w/ other projects at the moment, so haven't had a lot of time for CSV.jl lately. @Drvi, @nickrobinson251, and I have prototyped a new internal refactoring that currently lives here, which optimizes memory/perf for the chunked/row streaming case, and I want to adapt it to work for the CSV.File case as well. It should resolve the multithreading corner cases we continue to see pop up and be a better long-term solution for overall memory use as well. We just need to find the time to do the work to get it upstreamed to CSV.jl. So roughly my plan is we will probably have a 0.10.10 and maybe 0.10.11 release w/ some bugfixes and such, but 1.0 will be once we can upstream our new streaming work. I'm hopeful we can do that by the end of this year.

I also really appreciate the investigative efforts here by @bkamins and @timholy; I'm more than happy to make any changes necessary in InlineStrings.jl, SentinelArrays.jl, CSV.jl, WeakRefStrings.jl or wherever else if it means a better story for DataFrames.jl!

timholy · 2023-02-06T17:05:21Z

With JuliaLang/julia#48557 I can verify (on a different machine)

julia> @time using CSV
  0.713006 seconds (759.16 k allocations: 47.936 MiB, 9.82% gc time, 1.54% compilation time)

julia> @time using DataFrames
  2.397113 seconds (2.93 M allocations: 164.404 MiB, 5.63% gc time)

Fixes the recompilation during load.

bkamins · 2023-02-06T19:30:21Z

@timholy - I am not sure if it was discussed in other places but maybe the way to go would be to define ENV["JULIA_PACKAGE_PRECOMPILE"] and if it is set to "no" then skip precompilation instructions. Otherwise perform precompilation. This could allow e.g. Pluto.jl to disable precompilation if it is not desirable.

If this general solution is not something that you would find useful in general maybe we could add ENV["JULIA_DATAFRAMES_PRECOMPILE"] that would have the same effect but would be only limited to DataFrames.jl precompilation?

CC @KristofferC

timholy · 2023-02-06T20:12:10Z

You know about the last section of the SnoopPrecompile docs?

using SnoopPrecompile, Preferences
set_preferences!(SnoopPrecompile, "skip_precompile" => ["PackageA", "PackageB"])

That's strongly encouraged over the ENV solution, as the ENV solution can cause you to end up with an inconsistent cache (there's no record of what the ENV settings were when a given package was precompiled). Warning, though: I may change how this works to make the settings more "granular." Stay vaguely tuned over the next month or so.

This could allow e.g. Pluto.jl to disable precompilation if it is not desirable.

My impression is that @fonsp is planning to implement (or has implemented) utilities to sync the manifests of many different notebooks to a single "master" environment. (It's "just" a matter of copying the version info from one Manifest into the corresponding slot in a second Manifest.) I hope that should at least hold us over until the exciting work on parallel LLVM compilation lands.

bkamins · 2023-02-06T21:35:52Z

OK - thank you for an explanation.

docs/src/man/basics.md

src/other/precompile.jl

timholy

LGTM, see very small comments.

docs/src/man/basics.md

bkamins · 2023-02-07T20:17:59Z

After some more thinking and testing I buy the argument that CSV.jl is too heavy dependency for DataFrames.jl. However, SentinelArrays.jl and InlineStrings.jl seem relatively lightweight as we can see here:

julia> @time_imports using DataFrames
      0.8 ms  Statistics
      0.3 ms  Reexport
      0.2 ms  Compat
      6.1 ms  OrderedCollections
     59.4 ms  DataStructures
      0.5 ms  SortingAlgorithms
      0.8 ms  DataAPI
     15.7 ms  PooledArrays
      7.6 ms  Missings
      2.4 ms  InvertedIndices
      0.3 ms  IteratorInterfaceExtensions
      0.2 ms  TableTraits
      0.9 ms  Formatting
      0.3 ms  DataValueInterfaces
     13.9 ms  Tables
    335.9 ms  StringManipulation
     71.1 ms  Crayons
      0.8 ms  LaTeXStrings
    174.1 ms  PrettyTables
     12.2 ms  Preferences
      0.3 ms  SnoopPrecompile
     46.0 ms  SentinelArrays
     62.9 ms  Parsers
      6.4 ms  InlineStrings
   1004.4 ms  DataFrames

(they add 46 ms and 6.4 ms respectively, which I think is acceptable)

Now a comparison of timing of normal load of DataFrames.jl is as follows:

If we depend on CSV.jl

julia> @time using DataFrames
  2.256193 seconds (3.49 M allocations: 217.968 MiB, 4.23% gc time, 3.26% compilation time)

If we depend on SentinelArrays.jl and InlineStrings.jl

julia> @time using DataFrames
  1.910445 seconds (2.64 M allocations: 167.254 MiB, 3.77% gc time, 4.10% compilation time)

(time would be similar if we would not depend on SentinelArrays.jl and InlineStrings.jl)

The benefit of having SentinelArrays.jl and InlineStrings.jl is that in case someone uses them (directly or indirectly) we do not invalidate precompiled DataFrames.jl code (in practice if someone uses CSV.jl, but maybe in the future these packages will be used more widely). So:

CSV.jl users get some benefit (not 100%, but a lot)
non-CSV.jl users get speedup

nalimilan

It's too bad that we have to add dependencies just for precompilation, but it's probably worth it as a temporary measure. Ideally at some point Julia will be able to make these conditional on these packages being installed in the environment.

docs/src/man/basics.md

src/abstractdataframe/abstractdataframe.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

timholy · 2023-02-10T22:06:30Z

add dependencies just for precompilation

I suspect it's headed to "pseudo-stdlib" status. I plan to move SnoopPrecompile out to JuliaLang sometime soon; I'm dragging my feet mostly because I wonder if we should rename it precisely to avoid conflating it with SnoopCompile. (They use similar techniques and thus are parallel in my mind, but they are also quite different.)

SnoopCompile is big with lots of dependencies, but SnoopPrecompile is tiny: https://github.com/timholy/SnoopCompile.jl/blob/master/SnoopPrecompile/src/SnoopPrecompile.jl is the entire package (and it's about 40% docstring).

quinnj · 2023-02-10T22:21:32Z

I think @nalimilan's concern is having to add CSV/InlineStrings/SentinelArrays for precompilation, not SnoopPrecompile, which as you point out is lightweight.

timholy · 2023-02-10T23:17:36Z

Gotcha. Keep in mind that adding them is an efficient way of avoiding having your code invalidated, but it's not the only solution. The other main approach is to identify the inference failures in DataFrames.jl that are causing Julia to be uncertain about which methods will be dispatched and then fix those inference failures. That said, I'm fully on board with this being an expedient and very effective solution that will make things better for your users.

I'm painfully aware that SnoopCompile + ascend + Cthulhu is a big stack of code to learn, and reading Julia's type-inferred CodeInfos is a bit like a 2-language problem. I just started working on JuliaDebug/Cthulhu.jl#345 because I think it's long overdue that Julia have an easy way for relative newbies to identify and fix type-instability in their code. Should help fix invalidations and save lots of hours on discourse helping people resolve "why is Julia slower than LanguageX?" questions.

bkamins · 2023-02-11T07:31:19Z

identify the inference failures in DataFrames.jl that are causing Julia to be uncertain about which methods will be dispatched and then fix those inference failures.

I wanted to confirm one thing here. Since DataFrame is type unstable on purpose at some point we need to have this "dispatch uncertainty" (at the point where we move from type unstable to type stable code) and it is unavoidable. Do I understand this correctly?

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2023-02-11T09:42:50Z

Thank you! (I would love to continue the discussion to get a better understanding what can be done)

timholy · 2023-02-13T20:29:51Z

Yes, there are places where deliberate non-specialization is difficult to reconcile with resistance to invalidations. In such cases, setting Base.Experimental.@max_methods = 1 is probably your best hope. If that causes performance problems, perhaps one way to refine that strategy might be to split out the code that can tolerate it into a separate submodule, and only set @max_methods on the sub-module.

improve precompilation coverage

08eb0db

bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Feb 5, 2023

bkamins added this to the 1.5 milestone Feb 5, 2023

one more precompilation set

3f4a569

add CSV as dependency

df231f5

timholy mentioned this pull request Feb 6, 2023

Protect cmd_gen against invalidation JuliaLang/julia#48557

Merged

avoid dispatch ambiguity

8d11b99

add instructions how to disable precompilation

7e7a91c

bkamins commented Feb 7, 2023

View reviewed changes

docs/src/man/basics.md Outdated Show resolved Hide resolved

bkamins commented Feb 7, 2023

View reviewed changes

src/other/precompile.jl Outdated Show resolved Hide resolved

timholy reviewed Feb 7, 2023

View reviewed changes

docs/src/man/basics.md Outdated Show resolved Hide resolved

docs/src/man/basics.md Outdated Show resolved Hide resolved

docs/src/man/basics.md Show resolved Hide resolved

use only SentinelArrays and InlineStrings

2bee3d5

fix test

eb717aa

nalimilan approved these changes Feb 10, 2023

View reviewed changes

docs/src/man/basics.md Outdated Show resolved Hide resolved

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

Update src/abstractdataframe/abstractdataframe.jl

b2bed90

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Update docs/src/man/basics.md

72bdb21

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins merged commit 1b9fa19 into main Feb 11, 2023

bkamins deleted the bk/precompilation branch February 11, 2023 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve precompilation coverage #3285

improve precompilation coverage #3285

bkamins commented Feb 5, 2023 •

edited

Loading

bkamins commented Feb 5, 2023

bkamins commented Feb 5, 2023

timholy commented Feb 5, 2023

bkamins commented Feb 6, 2023

bkamins commented Feb 6, 2023

bkamins commented Feb 6, 2023 •

edited

Loading

timholy commented Feb 6, 2023

bkamins commented Feb 6, 2023 •

edited

Loading

timholy commented Feb 6, 2023

timholy commented Feb 6, 2023

quinnj commented Feb 6, 2023

timholy commented Feb 6, 2023

bkamins commented Feb 6, 2023

timholy commented Feb 6, 2023 •

edited

Loading

bkamins commented Feb 6, 2023

timholy left a comment

bkamins commented Feb 7, 2023

nalimilan left a comment

timholy commented Feb 10, 2023

quinnj commented Feb 10, 2023

timholy commented Feb 10, 2023

bkamins commented Feb 11, 2023

bkamins commented Feb 11, 2023

timholy commented Feb 13, 2023

improve precompilation coverage #3285

improve precompilation coverage #3285

Conversation

bkamins commented Feb 5, 2023 • edited Loading

Now I implemented step 1 (select precompilation statements)

Julia 1.9 main branch (old precompilation)

Julia 1.9 this PR (new precompilation)

Julia 1.8.5 this PR (new precompilation)

Decide what to do with InlineStrings.jl and SentinelArrays.jl

bkamins commented Feb 5, 2023

bkamins commented Feb 5, 2023

timholy commented Feb 5, 2023

bkamins commented Feb 6, 2023

bkamins commented Feb 6, 2023

bkamins commented Feb 6, 2023 • edited Loading

timholy commented Feb 6, 2023

bkamins commented Feb 6, 2023 • edited Loading

timholy commented Feb 6, 2023

timholy commented Feb 6, 2023

quinnj commented Feb 6, 2023

timholy commented Feb 6, 2023

bkamins commented Feb 6, 2023

timholy commented Feb 6, 2023 • edited Loading

bkamins commented Feb 6, 2023

timholy left a comment

Choose a reason for hiding this comment

bkamins commented Feb 7, 2023

If we depend on CSV.jl

If we depend on SentinelArrays.jl and InlineStrings.jl

nalimilan left a comment

Choose a reason for hiding this comment

timholy commented Feb 10, 2023

quinnj commented Feb 10, 2023

timholy commented Feb 10, 2023

bkamins commented Feb 11, 2023

bkamins commented Feb 11, 2023

timholy commented Feb 13, 2023

bkamins commented Feb 5, 2023 •

edited

Loading

Julia 1.9 `main` branch (old precompilation)

bkamins commented Feb 6, 2023 •

edited

Loading

bkamins commented Feb 6, 2023 •

edited

Loading

timholy commented Feb 6, 2023 •

edited

Loading