Qi Compiler Sync Sept 23 2022

Lessons from Analyzing Require Latency

Qi Compiler Sync Sept 23 2022

Adjacent meetings: Previous | Up | Next

Summary

We reviewed our findings on the subject of require latency (that is, the cost of (require module), in general), and came to conclusions on:

How important is this performance metric?
As an indicator of actual performance, how reliable is it?
How to use ecosystem tools to probe require latency.
The evolution of load times vs runtime for Racket 6.X to present.
What is the contribution of loading vs instantiation typically?
Understanding pathological cases (e.g. across the typed/untyped boundary).
Is load time dominated by amount of code or number of files?
Are high load times a problem?

Note: For readers who didn't attend and are interested in the topic, feel free to skim or skip sections below that aren't relevant for your interests. Relevant sections are crosslinked so you should be able to find what you missed if you skipped it.

Background

In recent times, Qi has been shedding dependencies in the quest for maximum performance, both towards lowering build times as well as load times, and of course, the goal of the compiler project is to improve performance at runtime.

Yet at the same time, Qi recently introduced a dependency on the bindingspec library that appeared to have increased the load time by 37%. Was this something to be concerned about? This time around, we settled on some answers (spoiler: no).

Tools

Last time, we used a number of tools to probe the require latency. In the interim, there were some developments:

Bogdan added support in the Racket import profiler for a -P flag to indicate the maximum phase dependencies to consider. E.g.

$ raco profile-imports -l -P 0 qi

... profiles only phase 0 imports and only the "leaf" nodes in the dependency graph, and sorts the output in descending order of proportion of time contributed.

Michael discovered that the majority of the latency contribution from bindingspec came from loading rather than instantiating the module, and additionally that require latency varied significantly as a function of already-loaded modules (e.g. racket/base vs racket).
Sid improved the require-latency tool to use a more tractable method of measurement, and also to support being able to specify a "modulus" -- that is, a set of modules to be loaded beforehand that should be factored out in estimating the remainder of the require latency contributed by a particular module. E.g.

$ raco require-latency -m racket qi

Michael also shared this way to output all files loaded by a module after taking a modulus into account:

racket -l racket/base -l qi -e "(let ([old (current-load/use-compiled)]) (current-load/use-compiled (lambda (p n) (displayln p) (old p n))))" -e "(time (dynamic-require 'typed-stack 0))"

Learnings and Findings

Loading vs Instantiating

Modules required at any phase are always loaded. This is because (as we learned from Michael who asked Matthias who asked Matthew) higher-phase dependencies could always reduce the phase using (require (for-template)), and it would be necessary to support any module load-time side effects from that. As a result, the compiler cannot avoiding loading modules even if they are required at a higher phase level (e.g. for-syntax).

Only modules required at phase 0 (i.e. simply as (require module)) are instantiated.

Pathological Cases

Typed Stack?

In the lead-up to the Qi 3.0 release, the require latency from (require qi) was reduced by a factor of 3 (~330 ms). The gain came predominantly from removing the typed-stack dependency, which is a perfectly reasonable implementation of the stack data structure in (typed) Racket. We wanted to understand why it contributed a high latency. By using the tools above, we were able to determine that the cause is that (require typed/racket) doesn't overlap much with regular (require racket). As a result, the majority of imports from typed racket are not shared with untyped dependencies used by untyped client libraries. In other words, using dependencies across the typed/untyped boundary comes with a standard cost that would be "amortized" if there were multiple typed modules being depended upon, but in this case as there was only one, the cost was reflected entirely in a single dependency -- and this cost doesn't have to do with the specific dependency (typed-stack) but with requiring typed modules in untyped code.

"Report Benchmarks" module

The module qi-sdk/profile/report.rkt has a require latency of ~1800ms. This is very much an outlier, as for most Racket libraries the latency varies from 100ms to 500ms (on the same hardware). We found that about half of this latency was contributed by the math dependency, which, once again, is a typed module. Still, we weren't sure what the remainder of the time was attributable to. One theory was that as the file contains numerous lines like these (probably about 50-60 such lines):

(require
 (prefix-in one-of?: (submod "forms.rkt" one-of?))
 (prefix-in and: (submod "forms.rkt" and))
 (prefix-in or: (submod "forms.rkt" or))
 ...

... that, possibly, even though the submodules here are all contained in the same file (forms.rkt), that the module forms.rkt was being loaded afresh each time. The output from the current load handler showed forms.rkt appear dozens of times, but it seemed plausible that, nevertheless, Racket would use a cached version at the next step rather than load from disk each time.

So, for this example, it could be the dependencies, or it could be (in principle avoidable, optimizable) load times, or perhaps something else.

Is Load Time Dominated by Amount of Code or Number of Files?

We had discovered already that the latency contribution from bindingspec was predominantly from loading the module rather than instantiating it. This was determined by requiring it for-syntax without using it, and finding that the latency contribution wasn't much different than in code that required it at phase 0 and used it, too.

Extrapolating from this, we can roughly say that:

In the most common cases, load time dominates instantiation time.

Given this, we wanted to understand whether this cost from load time was dominated by the amount of code loaded from disk, or the number of files in which the code was contained.

"Code Loader" Test Application

To investigate this, we created a test application that generates lots of code, and puts the generated functions into either (a) individual modules each defining and providing one large function, or (b) a combined module defining and providing all of the same functions. The goal was to compare the require latency for the separate vs the combined case via comparing two main modules, main-combined.rkt and main-separate.rkt that require either the combined module or the separate modules individually.

After compiling the generated modules using raco make output/*.rkt, we used require-latency to measure the times taken to require each of the main modules. With 100 generated modules each containing a 1000+ line function resembling:

  (define (name . args)
    ;; we use displayln since, as a side effect, it won't be
    ;; optimized away by the compiler.
    (displayln "hi")
    (displayln "hi")
    (displayln "hi")
    ... 1000+ ...
  )

... (so that the combined module contains ~100,000 lines) the results come out to:

$ raco require-latency -f output/main-combined.rkt
2 ms
$ raco require-latency -f output/main-separate.rkt
60 ms

We wanted to verify that the compiled code actually preserved the long functions and didn't shrink them to trivial implementations, and for this purpose we used the PLT_LINKLET_SHOW_CP0 environment variable, which, when set, dumps the output of each compilation stage to the console:

$ export PLT_LINKLET_SHOW_CP0=1
$ cd output
$ rm compiled/mod0*  # delete the existing compiled output just to make sure
$ racket mod0.rkt 
;; linklet ---------------------
(linklet
  ((.get-syntax-literal!) (.set-transformer!) (configure)) ()
  (void) (configure #f) (void))
;; schemified ---------------------
(lambda (instance-variable-reference .get-syntax-literal!1
         .set-transformer!2 configure3)
  (print-as-expression #t)
  '#<void>)
;; cp0 ---------------------
(lambda (instance-variable-reference .get-syntax-literal!1
         .set-transformer!2 configure3)
  ((#3%$top-level-value '1/print-as-expression) #t)
  (#2%void))
;; compiled ---------------------
done
;; linklet ---------------------
(linklet ((.get-syntax-literal!) (.set-transformer!) (displayln))
  (mod0) (void)
  (define-values (mod0)
    (#%name
      mod0
      (lambda args_1
        (begin
          (displayln "hi")
          (displayln "hi")
          (displayln "hi")

          ... lots of lines ...

          (displayln "hi")
          (displayln "hi")
          (displayln "hi")))))
  (void))
;; schemified ---------------------
(lambda (instance-variable-reference .get-syntax-literal!1
         .set-transformer!2 displayln3 mod04)
  (define mod0
    (#%name
      mod0
      (lambda args_1
        (begin
          (displayln3 "hi")
          (displayln3 "hi")
          (displayln3 "hi")

          ... lots of lines ...

          (displayln3 "hi")
          (displayln3 "hi")
          (displayln3 "hi")))))
  (variable-set!/define mod04 mod0 'consistent))
;; cp0 ---------------------
(lambda (instance-variable-reference .get-syntax-literal!1
         .set-transformer!2 displayln3 mod04)
  (letrec ([mod0 (lambda args_1
                   (displayln3 "hi")
                   (displayln3 "hi")
                   (displayln3 "hi")

                   ... lots of lines ...

                   (displayln3 "hi")
                   (displayln3 "hi")
                   (displayln3 "hi"))])
    (variable-set!/define mod04 mod0 'consistent)))
;; compiled ---------------------
done

... which showed that the compiled code was large, as intended.

So, that seemed to validate that the measured times were legitimate indicators, and thus, support that:

Load time is dominated by number of files rather than just the amount of code.

Require Latency Cannot be Considered in Isolation

Ultimately, the most relevant finding for our purposes was that require latency cannot be considered in isolation, since the contribution of a library to the require latency of a project is a function of the other dependencies of the project. Using the new "modulus" argument in the require-latency raco tool, which pre-loads the specified modules before measuring require latency of the target module, we see this clearly:

$ raco require-latency bindingspec
176 ms
$ raco require-latency -m racket bindingspec
19 ms

That is, in a project using #lang racket or (require racket) -- as opposed to racket/base -- the latency contributed by bindingspec is almost negligible.

Using a pre-bindingspec branch of Qi for comparison, we see the naive overhead on top of Qi:

$ raco require-latency qi
175 ms
$ raco require-latency -m qi bindingspec
33 ms

... which decreases as a function of other dependencies in the project:

$ raco require-latency -m qi -m racket/set bindingspec
22 ms

... and so on.

Therefore, require latency measured in isolation should be taken with a grain of salt.

About Load Time

The Evolution of Load Times Across Racket Versions

In general, since Racket 6.X, through the incorporation of various components of Racket's implementation into Racket itself (instead of C), including scope sets (v6.3), the expander (7.0), and Chez Scheme (8.0), the general trend has been better runtime performance side by side with higher load times. There are many reasons for this including the different compilation strategies in BC vs CS (Just In Time vs Ahead Of Time).

Are High Load Times a Problem?

In general, the emphasis on runtime performance seems to be a net win. But on the other hand, having high load times creates a disincentive to collaborate, since, for instance, even using a simple stack data structure implementation from another library proved to have prohibitive load-time overhead in the case of Qi. This does appear to have been symptomatic of a particular subclass of load times however -- those incurred when crossing the typed/untyped boundary -- and it may be that the problem is not more general than that.

Compiled Module Composition Scheme?

If the above results that:

Require latency is typically dominated by load time, and
Load time is typically dominated by number of files loaded rather than amount of code

are indeed accurate, then they suggest that optimizing this could provide significant gains.

One hypothetical option we discussed for this was the following "module composition" scheme to minimize the number of files loaded:

Given: A particular installation of Racket has N installed modules.
The compiler maintains a registry of previously compiled modules, both the individual N compiled modules, as well as "composed" modules that are each "equivalent" to some subset of the N modules. That is, (require ab) would be equivalent to (require a b). The manner in which these composed modules are generated is described below.
When a newly written module M is compiled, its module dependencies are enumerated by the compiler. It then finds the largest subset of these dependencies already in the registry. Then it repeats the process for the remainder of modules, and keeps repeating this until it has constructed a set of these registry modules that, together, would form a single module dependency for M (this set must always exist, assuming the N modules are individually compiled, to form the lowest level). It then "composes" these modules to produce a new module that is added to the registry. [Another possibility is to loosen the requirement so that instead of looking for a set of subsets that is equal to the desired set of modules, it looks for a set of subsets that "covers" the desired set of modules -- in this case, the compiler may end up including more code than is necessary in the produced binary, but that could be OK]
This ensures that, even though the number of possible subsets of installed modules is exponential, i.e. 2^N, that in practice, only the combinations that are needed would be constructed. We crudely estimated that this might end up creating quadratically many composed modules in practice instead of exponentially many.

The upshot of all this is that for any compiled module, there will always be exactly one module loaded at require time (per phase level, probably), no matter how many imports there actually are. On the other hand, it would lead to more space taken on disk by compiled modules, which would exhibit redundancy since the same contents would be present in many distinct compiled modules.

It may be that this approach could be applied for specific problematic cases, such as typed dependencies (but it would depend on whether the latency contribution from typed dependencies corresponds to the typical case observed, of being dominated by load time rather than instantiation time).

Loosening the guarantees on "Order of Effects" to be Lazier

Some of the slowness at load time is due to Racket's providing strong guarantees on the "order of effects" performed at module load time. Weakening these guarantees could provide another avenue for improvement by allowing the loading to be lazier [scribe's note: I don't know much about this - if you do, I'm happy to add more color here for the benefit of other readers].

Next Steps

We've convincingly established that require latency is not a concern as far as adding the bindingspec dependency to Qi, and aside from pathological cases, load times are an ecosystem issue that would need to be addressed at the VM level. So we can now return to the compiler work proper for next time. Some specific items:

Experiment with adding binding forms to Qi (prototype PR).
Investigate why datum matching doesn't match expander-produced core Qi in the initial compiler optimizations that we tried.
Write more compiler optimizations.

Attendees

Michael, Nia, Sid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly