precompilation #7977

StefanKarpinski · 2014-08-12T14:40:26Z

We already have precompilation of the Julia system image. There's been a lot of discussion of how to precompile packages. This issue is for discussion of how to approach precompilation of packages.

StefanKarpinski · 2014-08-12T14:47:16Z

There's some amount of processing of individual source files and packages that can be cached – parsing & lowering, and compressed AST generation. Some amount of type inference and code generation could also potentially be done independently, but that's much trickier. Full type inference, inlining, and code generation in general cannot be done until later. Currently, system image generation effectively takes a "snapshot" of the current process – what if we just make snapshotting really easy and make it possible to take a snapshot of a Julia process at any point and then start that snapshot again later? This gives a "layering" approach to precompilation and is quite easy to think about: get your REPL in a state you want to work from, snapshot it, and then start working from the snapshot instead of from an empty REPL.

JeffBezanson · 2014-08-12T15:01:08Z

dup of #4373

vtjnash · 2014-08-12T15:43:48Z

The main issue is what a snapshot would even mean. If I opened a file, or initialized a dll, what does the new process see? If I start a task, presumably this is part of the state too, but what does it mean to snapshot a Task?

timholy · 2014-08-12T15:57:55Z

Whole-system snapshots don't seem to be that different from our current userimg.jl approach, and that doesn't seem to be very popular. (I seem to be one of very few people reporting bugs with it...).

I think its main limitation is that anything I use heavily (i.e., in practice contributes a lot to my load times) also tends to be something I change frequently. I don't want to have to rebuild all of Julia to test a one-line fix. (Yes, you have to do that if you're modifying base Julia, but we all know that's not ideal, and it gets far worse if you're building Julia and precompiling a dozen packages.) So I find myself choosing between whether package A should be precompiled or remain flexible. In practice I tend to precompile one or two packages, tops.

We need a granular approach: package A can be precompiled as a standalone object, to be updated if A's source (or any macros in other packages it depends on) changes. I know it's not easy, but it's reality.

johnmyleswhite · 2014-08-12T16:30:44Z

+1 for dependency tracking to determine when compilation needs to happen. I had to stop using userimg.jl because I would forget to recompile Julia when working on packages like DataFrames.

JeffBezanson · 2014-08-12T16:31:45Z

Speeding up the compiler can also be part of this. I don't know what kind of gains are possible, but 2x seems a conservative assumption. It's still not well understood why the system does so much work when loading packages. In theory, if a package contains only type and method definitions, loading it should take hardly any time, because compilation is mostly lazy. In practice, many packages run code at load time, but my impression is that they don't run enough stuff to require compiling a large percentage of the package.

timholy · 2014-08-12T18:07:04Z

Agreed, especially given that profiling indicates that most of the time is spent in flisp.c. See #7280 (comment) and the gist posted below that.

JeffBezanson · 2014-08-12T18:12:05Z

commit 8973ed1 might change that feature of the profile, though it's clearly not enough.

jakebolewski · 2014-08-12T18:17:37Z

@timholy have you gone back and profiled again after 8973ed1?

timholy · 2014-08-12T18:54:40Z

Nope, but if your machine happens to give good C backtraces it's really easy to test this yourself:

using DataFrames, ProfileView
@profile reload("DataFrames")  # can't have using inside @profile
ProfileView.view(C=true)

I just don't happen to be on such a machine at the moment.

(If Tk doesn't work for you, the IJulia interface is really nice.)

timholy · 2014-08-12T21:19:37Z

It's all in flisp.c. More than 90% of the time.

I can't export the graph for you easily at the moment because of timholy/ProfileView.jl#16. But if the above statements work for you, you can easily get this (and far more) information directly yourself.

vtjnash · 2014-08-12T21:26:35Z

Maybe you can parse using JuliaParser to see where the possible hot-spots are (since it shares a common structure with the flisp parser)?

timholy · 2014-08-12T21:27:27Z

Actually, I could do a screen capture:

Open the images in a separate tab and you'll see them at high resolution. The first shows me hovering over a brown bar, with flisp.c, fl_applyln: line 761. That bar takes up about 60% of the horizontal width, meaning it (and things that it calls, which represent the stack above it) accounts for 60% of the time. The second image is me hovering over a green bar in the next big stack to the right, which is also the same spot in the same file (but arrived at via a different call-chain). That accounts for most of the rest of the time.

So basically everything except the two narrow stacks on the right are flisp.c. And guess what: the first of these is flisp.c, too! The very last substantive stack is in gf.c, remove_conflicting: 1209.

julia> versioninfo()
Julia Version 0.3.0-rc3+8
Commit 1465486* (2014-08-12 06:46 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY SANDYBRIDGE)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

JeffBezanson · 2014-08-12T21:27:46Z

That makes me think eval and/or macro expansion is the culprit, since that requires running the front-end's lowering passes. We know the parser is not this slow.

timholy · 2014-08-12T21:49:53Z

It's hard to find a significant package that doesn't use any macros or call eval. But I just tested with HyperDualNumbers, which has just a single

@vectorize_1arg Real hyper

and yet has >300loc.

The results are basically the same. If I hover over the bar corresponding to ./boot.jl, include: line 245:

and then work my way upward until it starts fragmenting into subfunctions, I get this call chain:

/boot.jl, include: line 245
/usr/local/julia/julia/src/toplevel.c, jl_load: line 578
/usr/local/julia/julia/src/toplevel.c, jl_parse_eval_all: line 531
/usr/local/julia/julia/src/ast.c, jl_parse_next: line 514
/usr/local/julia/julia/src/flisp/flisp.c, fl_applyn: line 761
/usr/local/julia/julia/src/flisp/flisp.c, _applyn: line 725
/usr/local/julia/julia/src/flisp/flisp.c, apply_cl: line 1892
/usr/local/julia/julia/src/flisp/flisp.c, do_trycatch: line 935

For the narrower stack immediately to the right, it starts this way:

./boot.jl, include: line 245
/usr/local/julia/julia/src/toplevel.c, jl_load: line 578
/usr/local/julia/julia/src/toplevel.c, jl_parse_eval_all: line 541
/usr/local/julia/julia/src/toplevel.c, jl_toplevel_eval_flex: line 455
/usr/local/julia/julia/src/ast.c, jl_expand: line 551
/usr/local/julia/julia/src/flisp/flisp.c, fl_applyn: line 761

and then continues in the same fashion (with the exact same lines).

Again, it's probably >90% of the total time, and this for a package with almost no macros.

JeffBezanson · 2014-08-12T21:53:21Z

HyperDualNumbers also takes very little total time to load. For a package that loads very quickly, it's possible most of the time is in the parser.

vtjnash · 2014-08-12T21:56:53Z

I have some code around showing that parsing a few hundred simple comprehensions in a function takes a few milliseconds. Lowering those to Julia AST for sticking in the function table then takes several seconds.

timholy · 2014-08-12T22:05:48Z

Nope. Just created my own, super-exciting package (I should register this sucker 😉):

function makescript(filename)
    open(filename, "w") do file
        for i = 1:10000
            funcname = "myfunc$i"
            write(file, "function $funcname(x)\n    x^2\nend\n")
        end
    end
end
makescript("/tmp/testscript.jl")
@profile include("/tmp/testscript.jl")

Load time is 6.8 seconds. And the result is even clearer: looks like about 97% of the time is in flisp.c.

As I believe I once mentioned, I have seen an extremely rare bug in ProfileView, but in this case I've checked Profile.print(io, C=true) and gone through the resulting file in emacs. It confirms the ProfileView results.

timholy · 2014-08-12T22:07:17Z

Man, if we just cache the lowered code, who cares about full precompilation?!?

JeffBezanson · 2014-08-12T22:15:06Z

Wow that is really shady. The front end must be doing something really stupid.

Fortunately, flisp has a profiler:

https://github.com/JeffBezanson/femtolisp/blob/master/tests/test.lsp#L219

timholy · 2014-08-12T22:22:29Z

Well, of course these don't test actually using the code, so of course we'll want compilation too.

To test this, I modified my script to call each one of these 10000 functions once:

function makescript(filename)
    open(filename, "w") do file
        for i = 1:10000
            funcname = "myfunc$i"
            write(file, "function $funcname(x)\n    x^2\nend\n")
        end
        write(file, "x = 3.2\n")
        for i = 1:10000
            funcname = "myfunc$i"
            write(file, "$funcname(x)\n")
        end
    end
end

The load time went from 6.8 seconds to 18 seconds. However, lowering is still more than half the time (remember that calling the functions added another 10001 lines to the file). Of the remainder, jl_compile and jl_generate_fptr account for most of the rest. Inference is roughly 10%.

timholy · 2014-08-12T22:24:54Z

Glad to hear about the flisp profiler! I'll have to leave this one up to you.

jakebolewski · 2014-08-12T22:29:13Z

Parsing performance is definitely not the problem.

julia> @time require("/tmp/testscript.jl")
elapsed time: 85.278999524 seconds (140911932 bytes allocated, 0.18% gc time)

julia> @time parse(open(readall, "/tmp/testscript.jl"));
elapsed time: 1.522348909 seconds (8937864 bytes allocated)

julia> @time Parser.parse(open(readall, "/tmp/testscript.jl"));
elapsed time: 1.121571927 seconds (66361048 bytes allocated, 9.37% gc time)

JeffBezanson · 2014-08-13T05:23:51Z

This discussion has been fruitful. Everybody, I am really sorry. I made an absurd mistake, which has been causing us to call lowering twice on many top-level expressions. I was mis-identifying which expressions needed to be expanded. Oh-so-usefully, the lowering pass is idempotent so I never noticed until I profiled it carefully.

454344f fixes it. For me, Gadfly now loads in 27 seconds instead of 48. Too embarrassed to celebrate :|

jakebolewski · 2014-08-14T22:37:22Z

It is kind of hard to evaluate parsing performance with only a reduced test case like tim-holy's example above. Parsing performance varies dramatically between different input files. I guess this is only really testing one code path through the parser, but it looks like a big win. Congrats! It would be nice to have a reduced test case that really stresses the gc in flisp.

timholy · 2014-08-14T22:41:15Z

That's great news!

JeffBezanson · 2014-08-14T23:19:18Z

Now I am running into bizarre failures and haven't been able to get lowering working yet.

tkelman · 2014-08-14T23:42:50Z

It's too bad speed.julialang.org has been dead for a couple months...

IainNZ · 2014-08-14T23:48:19Z

How long do they take to run? Could I stick it on the end of PkgEval's nightly run? I'm already building master's last good build nightly anyway.

tkelman · 2014-08-14T23:52:27Z

Probably a good idea to integrate the systems. I think it's just the perf tests, hopefully not that time consuming. Some value in comparing say accelerate to openblas on Mac, but even resurrecting the trendlines for the default Linux configuration would be nice.

staticfloat · 2014-08-14T23:56:13Z

@tkelman resurrecting speed.julialang.org into something more modern is on my TODO list. It's been dead because the machines the perf tests were running on are no longer doing that, and the process I had for building nightly versions was kind of janky. I've setup a buildbot instance which is MUCH better suited to the task, and have been getting everything setup for the last week or so. My dream is to have a setup where continuous testing, builds and speed tests can all happen in a flexible framework. Right now I have OSX build/test/deplot, and Ubuntu build/test/deploy working, along with OSX bottle building for Homebrew.jl.

JeffBezanson · 2014-08-15T02:56:28Z

I'm now able to both parse and lower the 25000 function file. I don't think I know chicken well enough to come to a conclusion yet, but so far chicken with default settings takes 19 seconds, and flisp takes 15. Chicken reported 80% GC time, so I tried giving it a 256MB heap, and this reduced its run time to 5.3sec. flisp can play the same space-vs-time game, and with the same heap size took 11.7sec.

It seems Chicken's GC is absurdly sensitive to heap size, but with a big enough heap it can be 2-3x faster than flisp.

A further problem is that AFAICT setting a heap size in chicken also fixes the heap size so it can't grow, and it aborts if it runs out of space. This is not ok. They really need a different heap growth policy, and with that their default performance could be much better.

I tried the -nursery option which controls gen0 size, but it caused a segfault.

Chicken is different from both flisp and julia in that all top-level expressions run at run time, not at compile time. I don't know how much this affects performance, but I feel it leads to some strange behaviors. I have some top-level expressions that call eval, and the code compiles fine but then gives a run-time error due to an unsupported run-time use of eval. It's possible to fix this using eval-when, but now I'm a bit more certain we made the right choice in julia.

StefanKarpinski · 2014-08-15T02:59:32Z

Jeez. 256MB heap? That seems absurdly large.

jakebolewski · 2014-08-15T03:00:33Z

Lisp generates a lot of garbage :-)

On Thu, Aug 14, 2014 at 10:59 PM, Stefan Karpinski <notifications@github.com

wrote:

Jeez. 256MB heap? That seems absurdly large.

—
Reply to this email directly or view it on GitHub
#7977 (comment).

StefanKarpinski · 2014-08-15T03:03:17Z

The port the parser to Julia option is looking better and better all the time.

jakebolewski · 2014-08-15T03:07:50Z

@JeffBezanson I guess the other one to try would be gambit. Have you pushed all your latest changes? It might also make sense to reach out to the chicken-scheme community. Schemers seem to be just as fanatical as Julian's when it comes to optimizing code for their favorite scheme implementations.

JeffBezanson · 2014-08-15T03:09:58Z

Turned out 128MB was nearly as good. 64MB worked but was way slower. In any case, that kind of thing is a machine's job :)

JeffBezanson · 2014-08-15T03:13:21Z

Ok, all changes pushed to jb/r5rs. In the current state, the code is actually significantly slower in flisp than on master. Probably because of indirection through compatibility aliases for key procedures.

JeffBezanson · 2014-08-15T03:17:15Z

@jakebolewski yes we should do that. They might have more GC tuning tricks, or can perhaps even fix things for us.

jakebolewski · 2014-08-15T04:25:31Z

Ok, I've been playing with this long enough. Here is a quick guide to testing this out:

Checkout @JeffBezanson's jb/r5rs branch
Download / install chicken scheme (chicken is bootstrapped, so to build chicken you need a version of chicken already installed)
compile the r5rs-compat.scm file in the src directory with the chicken compiler.

csc <compiler options> -dynamic

define a test file

(load "r5rs-compat.so")
(define (profile-frontend fname)
  (let* ((ast (time (jl-parse-file fname))))
    (time (julia-expand ast)))
  #t)
(profile-frontend "/tmp/torture.jl")

run the test file in the interpreter

julia/src [a865ca2●] » csi test.scm

CHICKEN
(c) 2008-2014, The Chicken Team
(c) 2000-2007, Felix L. Winkelmann
Version 4.9.0.1 (stability/4.9.0) (rev 8b3189b)
linux-unix-gnu-x86-64 [ 64bit manyargs dload ptables ]
bootstrapped 2014-06-07

; loading test.scm ...
; loading r5rs-compat.so ...
; loading /usr/lib/chicken/7/srfi-60.so ...
2.56s CPU time, 0.566s GC time (major), 2058936 mutations, 28/7289 GCs (major/minor)
30.91s CPU time, 25.967s GC time (major), 28380143 mutations, 182/25774 GCs (major/minor)

@JeffBezanson which generated file were you using exactly? (many versions have been posted to date)

jakebolewski · 2014-08-15T06:15:41Z

Are you using the latest from source bulid? Out of the box the performance was pretty good, a solid 2x. With a little tuning I'm getting:

julia/src [jcb/scheme●] » csc -O4 -u -lfa2 -inline  r5rs-compat.scm
...
[jcb/scheme●] » csi test.scm                                            

CHICKEN
(c) 2008-2014, The Chicken Team
(c) 2000-2007, Felix L. Winkelmann
Version 4.9.0.1 (stability/4.9.0) (rev 8b3189b)
linux-unix-gnu-x86-64 [ 64bit manyargs dload ptables ]
bootstrapped 2014-06-07

; loading test.scm ...
; loading r5rs-compat.so ...
; loading /usr/lib/chicken/7/srfi-60.so ...

1.71s CPU time, 0.34s GC time (major), 1713936 mutations, 19/5826 GCs (major/minor)
8.627s CPU time, 4.606s GC time (major), 23650143 mutations, 37/20337 GCs (major/minor)

# master version
julia/src [jcb/scheme●] » flisp/flisp test2.scm                                   
Elapsed time: 3.215708017349243 seconds
Elapsed time: 24.24796199798584 seconds

this is with the torture test which I'm finding hits the gc fairly hard.

JeffBezanson · 2014-08-15T06:40:00Z

I hadn't tried building a .so, but with that I sometimes get

; loading main.scm ...
; loading r5rs-compat.so ...
[panic] `##sys#error-hook' is not defined - the `library' unit was probably not linked with this executable - execution terminated

I'm using the v4.9.0.1 Arch package.

What test file are you parsing there?

jakebolewski · 2014-08-15T06:53:24Z

I haven't ran into that error. You can try enabling different default compilers / linkers with the csc switches.
I'm using the flisp test I posted above with 25000 functions. I'm having a hard time finding examples that can pass through both the parser and lowering at the moment. Most of the time I just get an infinite loop. It's not good to compile with the -u option I'm finding out. It seems to disable all signal handling so you can't SIGTERM the process.

JeffBezanson · 2014-08-15T06:54:28Z

Yes, quite a few more code changes are needed.

jakebolewski · 2014-08-15T07:15:24Z

Enough to kick the tires. This seems like it could be a 2x+ speedup. It's also a bit crazy that your 2000 line interpreted flisp is only 2x slower than an optimizing compiler developed over 15 years.

ViralBShah · 2014-08-15T09:44:39Z

That is pretty crazy. As @stevengj has pointed out, with a decent compiler and good choice of algorithms, it is possible to get to within 2-4x of the performance of many numerical libraries too.

timholy · 2014-08-15T14:05:09Z

This is a pretty encouraging result. No matter what path gets chosen, it proves that it's possible to do 2-3x better, at least on trivial code. Moreover, if there's some opportunity to combine multiple passes through the AST into a single pass, I imagine there might be even more savings possible.

But it does indeed emphasize the impressive performance/size ratio of flisp!

JeffBezanson · 2014-08-16T04:26:04Z

I tried adding #defines to flisp to disable all of its run time checks, but interestingly this gave no performance gain for me. I guess modern compilers and CPUs do amazing things.

timholy · 2014-08-16T09:22:42Z

I've been pretty amazed at how often branches are basically free. My favorite example is that our reduce-across-dims code conceptually now works like this (example assuming a matrix as an input):

function sum!(R, A)
    for i_2 = 1:size(A,2)
        j_2 = size(R,2) == 1 ? 1 : i_2
        for i_1 = 1:size(A,1)
            j_1 = size(R,1) == 1 ? 1 : i_1
            R[j_1,j_2] += A[i_1,i_2]
        end
    end
end

Two years ago I would have said that the first thing to do was get that branch out of the inner loop. It actually turns out to have a performance impact of at most a few percent. And boy does it make it easy to write cache-friendly reduction algorithms (which, compared to non-cache-friendly algorithms, improve performance many-fold).

vtjnash · 2014-09-30T03:25:20Z

I've been playing with various levels of caching to see what speed advantages they might give:
https://github.com/vtjnash/Speed.jl

One thing I noticed is that the compiler seems to be too greedy about what it expands during eval. For example, the number of garbage objects generated from executing x = 1+1 in global scope seems a bit excessive:

julia> expand(parse("""
              x = 1+1
              """))
:($(Expr(:thunk, AST(:($(Expr(:lambda, {}, {{symbol("#s2")},{{symbol("#s2"),:Any,18}},{}}, :(begin 
        #s2 = 1 + 1
        x = #s2
        return #s2
    end))))))))

I think method lowering is still the most expensive step (behind macro evaluation, in many cases), but it seems that julia should be able to directly eval more expressions as well.

timholy · 2014-09-30T07:45:45Z

I guess improving this would help with code that generates code in loops, right? As you know from discussions over at Speed.jl, I've definitely found that to be a major factor in slow load times.

JeffBezanson · 2014-09-30T14:35:27Z

Hmmm, at some point in the past I think the front end was more economical with simple expressions like x = 1+1. We should try to bring that back.

StefanKarpinski added the performance label Aug 12, 2014

StefanKarpinski modified the milestones: 0.4, 0.4-projects Aug 12, 2014

StefanKarpinski mentioned this issue Aug 12, 2014

Shrinking Base and Introducing a Standard Library #5155

Closed

StefanKarpinski closed this as completed Aug 12, 2014

pao mentioned this issue Sep 8, 2014

Improving package load times #4373

Closed

timholy mentioned this issue Sep 11, 2014

Further speedups in flisp? Possible, or not? #8310

Closed

precompilation #7977

precompilation #7977

Comments

StefanKarpinski commented Aug 12, 2014

StefanKarpinski commented Aug 12, 2014

JeffBezanson commented Aug 12, 2014

vtjnash commented Aug 12, 2014

timholy commented Aug 12, 2014

johnmyleswhite commented Aug 12, 2014

JeffBezanson commented Aug 12, 2014

timholy commented Aug 12, 2014

JeffBezanson commented Aug 12, 2014

jakebolewski commented Aug 12, 2014

timholy commented Aug 12, 2014

timholy commented Aug 12, 2014

vtjnash commented Aug 12, 2014

timholy commented Aug 12, 2014

JeffBezanson commented Aug 12, 2014

timholy commented Aug 12, 2014

JeffBezanson commented Aug 12, 2014

vtjnash commented Aug 12, 2014

timholy commented Aug 12, 2014

timholy commented Aug 12, 2014

JeffBezanson commented Aug 12, 2014

timholy commented Aug 12, 2014

timholy commented Aug 12, 2014

jakebolewski commented Aug 12, 2014

JeffBezanson commented Aug 13, 2014

jakebolewski commented Aug 14, 2014

timholy commented Aug 14, 2014

JeffBezanson commented Aug 14, 2014

tkelman commented Aug 14, 2014

IainNZ commented Aug 14, 2014

tkelman commented Aug 14, 2014

staticfloat commented Aug 14, 2014

JeffBezanson commented Aug 15, 2014

StefanKarpinski commented Aug 15, 2014

jakebolewski commented Aug 15, 2014

StefanKarpinski commented Aug 15, 2014

jakebolewski commented Aug 15, 2014

JeffBezanson commented Aug 15, 2014

JeffBezanson commented Aug 15, 2014

JeffBezanson commented Aug 15, 2014

jakebolewski commented Aug 15, 2014

jakebolewski commented Aug 15, 2014

JeffBezanson commented Aug 15, 2014

jakebolewski commented Aug 15, 2014

JeffBezanson commented Aug 15, 2014

jakebolewski commented Aug 15, 2014

ViralBShah commented Aug 15, 2014

timholy commented Aug 15, 2014

JeffBezanson commented Aug 16, 2014

timholy commented Aug 16, 2014

vtjnash commented Sep 30, 2014

timholy commented Sep 30, 2014

JeffBezanson commented Sep 30, 2014