Compilation bottleneck #14556

timholy · 2016-01-04T02:36:44Z

I don't understand the compile toolchain very well, and I haven't yet gotten good profile info (see https://groups.google.com/d/msg/julia-dev/RKAahOwppNs/Kg0TFx_SBwAJ), but at least for the SubArray tests I'm guessing that this line, via jl_compress_ast, is an enormous bottleneck in compilation. I inserted a debugging line,

diff --git a/src/dump.c b/src/dump.c
index e11bff5..9c3c96f 100644
--- a/src/dump.c
+++ b/src/dump.c
@@ -618,6 +618,7 @@ static int is_ast_node(jl_value_t *v)

 static int literal_val_id(jl_value_t *v)
 {
+    jl_printf(JL_STDOUT, "len = %d\n", jl_array_len(tree_literal_values));
     for(int i=0; i < jl_array_len(tree_literal_values); i++) {
         if (jl_egal(jl_cellref(tree_literal_values,i), v))
             return i;

and saw that most of the lengths were around 10k or so. That's a heck of a lot of calls to jl_egal. Is there anything that can be done to reduce this number?

The text was updated successfully, but these errors were encountered:

vtjnash · 2016-01-04T02:47:07Z

we could put them in an ObjectIdDict, but that may make it slower in the expected usage. that table is supposed to be nearly empty, but it can get populated if code is interpolating non-ast objects into the ast.

perhaps we should switch to using a per-linfo list of literals? the list could re-use the codegen roots list -- it seems likely that these same objects are likely getting copied there by emit_linfo_root during codegen anyways.

JeffBezanson · 2016-01-04T03:50:43Z

At one point we switched to per-module literal lists since many values ended up being shared by several functions. But I agree it seems silly to have both that and linfo->roots.

10k is way more than I'd expect. I wonder if we can cut that down (e.g. by serializing more things). Or just fully switch to a bytecode representation.

timholy · 2016-01-04T04:01:10Z

It will be a little while, but to save anyone else time, I'm collecting the histogram of counts by running make testall1 with

diff --git a/src/dump.c b/src/dump.c
index e11bff5..50ab040 100644
--- a/src/dump.c
+++ b/src/dump.c
@@ -6,6 +6,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <assert.h>
+#include <stdio.h>

 #include "julia.h"
 #include "julia_internal.h"
@@ -616,8 +617,21 @@ static int is_ast_node(jl_value_t *v)
         jl_is_labelnode(v) || jl_is_linenode(v) || jl_is_globalref(v);
 }

+int tlvlen[50000] = { 0 };
+
+JL_DLLEXPORT void jl_write_tlvlen(void)
+{
+    FILE *f = fopen("/tmp/tlvlen.bin", "w+");
+    fwrite(tlvlen, sizeof(int), 50000, f);
+    fclose(f);
+}
+
 static int literal_val_id(jl_value_t *v)
 {
+    int n = jl_array_len(tree_literal_values);
+    if (n >= 50000)
+        n = 49999;
+    tlvlen[n] += 1;
     for(int i=0; i < jl_array_len(tree_literal_values); i++) {
         if (jl_egal(jl_cellref(tree_literal_values,i), v))
             return i;
diff --git a/src/julia.h b/src/julia.h
index b93c268..03545fc 100644
--- a/src/julia.h
+++ b/src/julia.h
@@ -1678,6 +1678,8 @@ JL_DLLEXPORT extern const char* jl_ver_string(void);
 JL_DLLEXPORT const char *jl_git_branch(void);
 JL_DLLEXPORT const char *jl_git_commit(void);

+    JL_DLLEXPORT void jl_write_tlvlen(void);
+
 // nullable struct representations
 typedef struct {
     uint8_t isnull;
diff --git a/test/runtests.jl b/test/runtests.jl
index 969c72e..6b7cb00 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -72,3 +72,5 @@ cd(dirname(@__FILE__)) do
     @unix_only n > 1 && rmprocs(workers(), waitfor=5.0)
     println("    \033[32;1mSUCCESS\033[0m")
 end
+
+ccall(:jl_write_tlvlen, Void, ())

Will report back when it finishes.

JeffBezanson · 2016-01-04T04:35:54Z

It might also make sense to count the total iterations spent in that loop (# of jl_egal calls). Or maybe you have that already?

timholy · 2016-01-04T04:40:15Z

Yeah, I thought of that after I launched the first one. FWIW here are the results:

~~Median value was 10, mean value was 254.~~

Want me to collect the # of jl_egal calls too?

timholy · 2016-01-04T04:48:08Z

Wait, I was totally borked on those statistics. Median value is 10024, mean value is 11305.

timholy · 2016-01-04T04:55:54Z

Plotting the cumulative histogram on a linear scale suggests there's one group of "short lists" (<300) and another with "long lists" (between about 7k and 30k):

timholy · 2016-01-04T05:13:58Z

If you just count jl_egal calls, it's a radically different picture:

Median value 128, mean value 1585.

timholy · 2016-01-04T15:08:02Z

I decided to profile the test suite. I had to exclude a few tests (profile because it calls Profile.clear(); also "cmdlineargs", "file", "libdl", "parallel" because they gave problems for reasons I didn't track down). I ran this on release-0.4 (0.4.3-pre6), because I don't trust profiling on 0.5 due to #14281. I collected one sample every 10 ms (100samples/sec).

Analysis script (whose purpose was to show "flat" results that don't suffer from the duplicates-due-to-recursion problem as illustrated by dumbsum) and results are here: https://gist.github.com/timholy/67762c65cbfe4226b43a. I only included lines that accounted for at least 100 samples, i.e., at least 1s of time. Scroll to the end to see the worst offenders.

While it's a shame that so many important lines lack good backtrace info (here's hoping that @Keno's work will help), to me it looks as if jl_egal is the single worst "low-level" offender, accounting for more than 8% of the total number of samples collected. That, combined with the fact that the number of jl_egal calls is clearly not as small as seems to have been expected, suggests there's an opportunity for measurable improvement.

Keno · 2016-01-04T15:19:10Z

Unfortunately, I'm probably not as much help here. I briefly implemented a more efficient/accurate profiler on top of Gallium, but it only worked on OS X and had a number of bugs that made it essentially unusable. That said, I do plan to revisit the topic of profiling once I've made some progress on the debugger front.

I can say however, that I concur with your analysis of jl_egal being a significant chunk of our compile time. I don't have the traces anymore, but I remember pointing this fact out to @JeffBezanson

timholy · 2016-01-04T15:37:31Z

Bummer about the backtraces. Hopefully we'll get there some day. I also just remembered that long ago I convinced myself (rightly or wrongly) that most of the borked backtraces are almost surely from our many C functions that are declared static---they're not exported in the .so file, and as a consequence our means of looking them up fail.

Keno · 2016-01-04T15:53:56Z

Well, the backtraces shouldn't be super borked. What OS are you on?

timholy · 2016-01-04T15:54:11Z

See more info here.

timholy · 2016-01-04T15:55:37Z

Kubuntu 14.04.3 LTS, Trusty Tahr

Keno · 2016-01-04T15:59:50Z

Hmm, should be fine then. On OS X you have to jump through some hoops to make the dSYM available such that we can find it to get at the debug line info.

timholy · 2016-01-04T16:03:49Z

This affects every computer I've tested that's running Kubuntu. I seem to remember noticing a CentOS machine did better, however (can't test anymore, that machine was wiped clean and replaced with Kubuntu).

I could try to set up an account for you on my machine---it won't be trivial because here they're pretty draconian about the firewall, but I suspect I can find a way.

KristofferC · 2016-01-04T16:17:06Z

I did a full run of the tests (on latest master) with VTune XE 2016. If anyone want to download the traces, I put them up here (80M file, maybe I should have used a longer trace interval): https://dl.dropboxusercontent.com/u/14340581/julia_tests_vtune.tar.gz. Note, if too many download them, it is likely that Dropbox disable the file download.

Here are a few screenshots:

Bottom up tree, jl_egal stands out a lot:

Stuff getting called from jl_egal:

timholy · 2016-01-04T16:43:35Z

I get a 403 from the last two links---I didn't try the first link.

But that's what I was guessing from your email to julia-dev (I take it you fixed the line numbers?). Does that mean that 11.2% of the total test-suite time is that one line?? Surprising that the line above it doesn't take any appreciable time, given that this is a pointer comparison and the line above it also calls jl_typeof. Or perhaps things are being reordered, and the cost of both jl_typeof calls are being credited to that line?

Also, can you tell how many calls are on each line? If it usually exits at line 294, that might explain why the later lines aren't so expensive despite the fact that they involve more operations.

Another oddity is that the fact this appears at all in the backtraces leads me to guess that jl_egal is not inlined despite Ulrich's careful work in #9732. CC @drepper.

KristofferC · 2016-01-04T16:47:05Z

Updated screenshot links, do they work now?

timholy · 2016-01-04T16:48:05Z

Nope.

vchuravy · 2016-01-04T16:49:25Z

@timholy After refreshing they worked for me.

timholy · 2016-01-04T16:50:46Z

Thanks, @vchuravy. Wow, that is a standout.

KristofferC · 2016-01-04T17:03:00Z

I'm confused.. does the screenshot below show that one single instruction take 11% of the total CPU time of the whole test suite?...

timholy · 2016-01-04T17:14:37Z

CC @ArchRobison, who I suspect is the local expert on VTune.

drepper · 2016-01-04T17:39:50Z

On Mon, Jan 4, 2016 at 12:03 PM, Kristoffer Carlsson <
notifications@github.com> wrote:

I'm confused.. does the screenshot below show that one single instruction
take 11% of the total CPU time...

This is statistical sampling. The profile tick might hit the same
instruction over and over (in this case probably because of cache misses)
but what this really means is that the entire basic block as part of a
larger construct like a function is run something in the order of 11% of
all the time.

I am actually looking at profiles myself right now. Will see what I can
find. FWIW, I see jl_egal at the top of the list myself when using perf:

Overhead Command Shared Object Symbol
7.86% julia libjulia.so [.] jl_egal
5.71% julia libjulia.so [.]
jl_method_table_assoc_exact
3.76% julia libjulia.so [.] jl_subtype_le
2.62% julia libjulia.so [.]
jl_has_typevars__.constprop.49
2.32% julia libc-2.22.so [.] int_malloc
2.11% julia libjulia.so [.]
llvm::DenseMapBase<llvm::DenseMap<unsigned int, llvm::SmallVector<llvm::Re
1.68% julia libjulia.so [.]
lookup_type_idx.isra.21
1.67% julia libjulia.so [.] jl_apply_generic
1.42% julia libc-2.22.so [.] int_free
1.27% julia libc-2.22.so [.] malloc
1.16% julia libjulia.so [.]
jl_type_intersection_matching
1.14% julia libjulia.so [.] jl_serialize_value
1.08% julia libjulia.so [.]
jl_tuple_subtype.isra.8
0.92% julia libjulia.so [.]
llvm::RuntimeDyldImpl::resolveRelocations
0.74% julia libjulia.so [.] ptrhash_get
0.62% julia [kernel.vmlinux] [k] sys_mprotect
0.61% julia libjulia.so [.] gc_sweep_inc
0.61% julia libjulia.so [.] jl_f_get_field
0.60% julia libjulia.so [.]
jl_type_intersect.constprop.37
0.56% julia libjulia.so [.] jl_f_subtype
0.55% julia libjulia.so [.] jl_get_binding_
0.53% julia libjulia.so [.] jl_eqtable_get

If the issue is similar to last time it should be easy to figure out.

KristofferC · 2016-01-04T17:43:48Z

Thanks for the explanation. Also, good that two different tools seem to agree.

drepper · 2016-01-05T06:55:07Z

There was indeed again one function call in jl_egal which caused a stack
frame to be created. With the following change this can be avoided and
this shaves off around 0.5% from the total cost of jl_egal. The function
is still at the top but this is nothing to sneeze at.

I'll prepare a formal pull request in a bit but first I want to see whether
more can be done. Has there ever been a discussion about interposition of
symbols at runtime? Given that Julia runs on multiple platforms this is
not something that can be relied on for the project as a whole. Without
restricting interposition functions like jl_egal can never get inlined
since the compiler mustn't assume it knows what the definition looks like.
If this is restricted it opens all kinds of new optimizations. The source
code what ends up in libjulia will need more annotation, though, and for
maximum effect I expect more complex Makefile rules are needed.

The question about this is: do people think it's time to do these types of
optimizations already? Think about premature optimizations being evil etc.
If the C/C++ portion is declared sufficiently stable and people won't be
grossed out by some additional ugliness in the source code and perhaps in
the Makefile I think I can provide some patches which will get quite a bit
of performance. Of course I'd have concrete performance numbers before a
pull request but I want to get an impression about how such changes would
be received.

diff --git a/src/builtins.c b/src/builtins.c
index d50f7ce..be52583 100644
--- a/src/builtins.c
+++ b/src/builtins.c
@@ -210,7 +210,7 @@ JL_DLLEXPORT void jl_pop_handler(int n)

// primitives

-static int bits_equal(void a, void *b, int sz)
+static int NOINLINE bits_equal(void *a, void *b, int sz)
{
switch (sz) {
case 1: return *(int8_t)a == (int8_t)b;

On Mon, Jan 4, 2016 at 12:44 PM, Kristoffer Carlsson <
notifications@github.com> wrote:

Thanks for the explanation. Also, good that two different tools seem to
agree.

—
Reply to this email directly or view it on GitHub
#14556 (comment).

tkelman · 2016-01-05T13:02:26Z

Guess it depends just how ugly, but there's a fair amount of platform specific pieces in the C runtime already so I'd think platform-specific optimizations would be fair game as long as the changes are maintainable going forward.

eschnett · 2016-01-05T16:18:08Z

I used Instruments on OS X to profile Julia running Base.runtests("int"):

The output is sorted by L2 cache misses, and recursive calls to jl_egal is the top hit. Calls to jl_egal from literal_val_id are also expensive, and seem to be the main caller of jl_egal. The most expensive line of code I'm seeing is the first typeof call, i.e. the same basic block as mentioned above. This confirms @drepper's suggestion.

As an outsider I'm wondering: There is a function that compares types for equality, but no function for a "less than" relation or hash function? Does this mean that types are stored mostly in lists and arrays, and not in trees or heaps or hash tables?

timholy · 2016-01-05T16:28:14Z

but no function for a "less than" relation or hash function?

See jl_subtype_le, and indeed much of jltypes.c.

If you read from the top of this issue, you'll see that the number of calls to jl_egal greatly exceeds expectations. The best approach (aside from making jl_egal itself faster, which I fully support) is to call it fewer times. That might be achieved by restructuring the things that lie upstream, or by using a hash table.

JeffBezanson · 2016-01-05T17:08:28Z

@eschnett There is jl_object_id, which is a hash function that pairs with jl_egal. However for types these are structural, and do not implement type equality (in the sense of A<:B && B<:A). The type cache in jltypes.c needs to look up types, and it does this by imposing an arbitrary total ordering on a (commonly-occurring) subset of types. Extending this to all types appears to be a hard problem. It would basically require canonicalizing types, which might be harder than subtyping itself (both in the sense of programming effort and computational complexity)!

JeffBezanson · 2016-01-05T17:14:16Z

@drepper Yes jl_egal should be quite stable now and I think we can handle some ugliness there.

drepper · 2016-01-05T18:00:13Z

On Tue, Jan 5, 2016 at 12:14 PM, Jeff Bezanson notifications@github.com
wrote:

@drepper https://github.com/drepper Yes jl_egal should be quite stable
now and I think we can handle some ugliness there.

The ugliness will spread, unfortunately. The jl_egal function cannot be
optimized much more. I have a few changes but unless something can be
changed functionally I fear there is not much that can be done. The bigger
changes I am thinking about would hopefully allow jl_egal to be inlined in
all the call sites. The function is called so often that the cost of the
function call with the parameter setup and register variable restoration
will be significant. This requires (at least on ELF systems) that export
functions are also known using aliases and those aliases are known to be
hidden. Then whole program optimization or LTO can inline the function.
That's really the only way I can see for significant speedup.

vtjnash · 2016-01-05T18:05:22Z

Would that be equivalent to making it a static inline function in julia.h?

JeffBezanson · 2016-01-05T18:05:47Z

I agree with restricting interposition for jl_egal; replacing it would not be a good idea anyway. However it's also true that the bigger problem is us just calling it too often.

JeffBezanson · 2016-01-13T17:15:49Z

#14656 might fix this.

timholy · 2016-01-14T16:30:52Z

Exciting! Looking forward to testing it. Thanks so much for tackling this.

JeffBezanson added the performance Must go faster label Jan 12, 2016

This was referenced Jan 12, 2016

fix #14564, segfault in object_id of svec with NULL elements #14645

Merged

use li->roots instead of module constant table #14656

Merged

JeffBezanson mentioned this issue Jan 20, 2016

compiler performance #14743

Closed

vtjnash mentioned this issue Aug 16, 2019

Change precompilation format to be more amenable to copying #32705

Closed

vtjnash closed this as completed Mar 15, 2020

benlorenz mentioned this issue Jun 12, 2023

Compilation time regression on nightly due to #41099 #50082

Closed

BioTurboNick mentioned this issue Jun 17, 2023

Add MethodInstance roots to resolve compilation time regression #50204

Draft

Compilation bottleneck #14556

Compilation bottleneck #14556

Comments

timholy commented Jan 4, 2016

vtjnash commented Jan 4, 2016

JeffBezanson commented Jan 4, 2016

timholy commented Jan 4, 2016

JeffBezanson commented Jan 4, 2016

timholy commented Jan 4, 2016

timholy commented Jan 4, 2016

timholy commented Jan 4, 2016

timholy commented Jan 4, 2016

timholy commented Jan 4, 2016

Keno commented Jan 4, 2016

timholy commented Jan 4, 2016

Keno commented Jan 4, 2016

timholy commented Jan 4, 2016

timholy commented Jan 4, 2016

Keno commented Jan 4, 2016

timholy commented Jan 4, 2016

KristofferC commented Jan 4, 2016

timholy commented Jan 4, 2016

KristofferC commented Jan 4, 2016

timholy commented Jan 4, 2016

vchuravy commented Jan 4, 2016

timholy commented Jan 4, 2016

KristofferC commented Jan 4, 2016

timholy commented Jan 4, 2016

drepper commented Jan 4, 2016

KristofferC commented Jan 4, 2016

drepper commented Jan 5, 2016

// primitives

tkelman commented Jan 5, 2016

eschnett commented Jan 5, 2016

timholy commented Jan 5, 2016

JeffBezanson commented Jan 5, 2016

JeffBezanson commented Jan 5, 2016

drepper commented Jan 5, 2016

vtjnash commented Jan 5, 2016

JeffBezanson commented Jan 5, 2016

JeffBezanson commented Jan 13, 2016

timholy commented Jan 14, 2016