Skip to content

Qi Compiler Sync Aug 18 2023

Siddhartha Kasivajhula edited this page Aug 26, 2023 · 8 revisions

Harnessing Fusion

Qi Compiler Sync Aug 18 2023

Adjacent meetings: Previous | Up | Next

Summary

Dominik gave a demo of visualizing flows using flow graphs for the Actor Basic language. We resumed working on stream fusion and incorporating it into the compiler in a minimal way to prove that it works end-to-end, and discussed high level considerations regarding the structure of the compiler. So far, we have yet to attain the performance in practice that we saw in isolated benchmarks.

Background

Last time, we reached faster-than-Racket performance with stream fusion. This time we attempted to incorporate it into the Qi compiler to a minimal extent. Dominik independently has been working on visualizing flow graphs this week.

Visualizing Flows

Having graphical visualizations for flows has always been an interesting avenue for exploration, and Dominik has been working on just that for the Actor Basic language. It uses Graphviz under the hood and uses submodules to define the graph in a similar way to how scribble/srcdoc allows you to define docs alongside the code that are built in a separate phase. The Racket-Graphviz package came up as something that would simplify the implementation, but it is currently showing an error on the package index. It would be worthwhile to see what it would take to use such visualizations with Qi. We also discussed the possibility of using syntax-spec to add macros to Actor Basic.

Inspecting compiler output

Last time, we'd observed that the foldr version implementing filter-map was slower than a hand-coded recursion. We also observed that the continuation-passing version had the same performance as the hand-coded version. To get a clearer sense of the reasons, we looked at the intermediate stages of compilation by the Racket compiler. Specifically, the CP0 pass is the first one that happens after Chez Scheme expansion, showing the result of inlining and other Scheme-level optimizations. Using, for example, PLT_LINKLET_SHOW_CP0=1 raco make fold-test.rkt |less (where fold-test.rkt is a module containing the fold implementation), we were able to see that, even if we explicitly defined the built-in foldr implementation to be inlined, it was still slower than the hand-coded recursion due to the presence of error handling code and the need to allocate some data purely to preserve them in error contexts. We also saw that the compiled form of the continuation version was essentially identical to the hand-coded recursion, as expected.

Layout of the compiler

"Qi Normal Form"

Some forms like (~> (~> f g) h) and (~> (~> f) (~> g) h) are equivalent, so it seems important to have these in a standard form prior to attempting optimizations, similar to the "A-normal form" using in Scheme compilers. We called this "Qi Normal Form," and effectively, it amounts to ensuring that expressions are rewritten to a designated simplest form in terms of Qi identities, although in practice we may not actually recursively apply these identities as that may not be the most performant way to achieve the goal.

Order of optimization strategies

In order to optimize a Qi form, we'd like to recursively optimize every subexpression until it reaches a fixed point (i.e. depth-first), but at the same time also optimize the expression in terms of high level optimization rules (more like breadth-first). Sometimes, optimizing subexpressions may cause us to miss high level patterns that we could optimize even more effectively. One standard approach is to have two nested loops corresponding to these two optimization strategies. Either one of these could be the inner loop or the outer loop, and would represent different tradeoffs. We could potentially recur on an outer (high level) pass each time there is a change in any subexpression, but that would imply quadratic optimization time so it's likely infeasible. E-graphs were another approach discussed some time ago that solves this problem, but at the time we weren't sure if (a) there is a performant implementation of that that we could use, and (b) whether the "cost" of optimizations could be effectively defined, and (c) whether the optimization rules of Qi would be "swappable"/commutative enough to make E-graphs compelling, or whether the order of optimizations is likely to be more deterministic so that they could be hardcoded.

Performance goals

We'd like to have linear or Nlog(N) performance in implementing optimizations, and if there are parts that take quadratic or exponential time, those should be bounded to narrow fragments of programs.

What to Optimize

Qi's Core Values

Qi's core values-based routing forms such as amp / >< convert values to lists before operating on them, and then once again convert to values before returning them. We could optimize the internal list-based implementation using stream fusion but we would need to modify the map stream implementation to potentially yield multiple values before continuing, since each application of the function may produce any number of output values in Qi.

We decided to begin by optimizing the Racket list operations and return to this more complex version of stream fusion later.

Racket List Operations

To optimize Racket's list operations, we simply need to match literal uses of map and filter and so on, and rewrite them using (the continuation-passing version of) streams. We will be rewriting directly to Racket rather than Qi for this optimization, since this is an instance where we are explicitly choosing to optimize Racket code with respect to Qi invariants.

Harnessing Fusion

We added a very simple rule in the existing rudimentary layout of compiler passes to detect fusable list operations -- and just map or filter, for now -- and then rewrote that using stream fusion.

Performance for now is slightly better than Racket but not by as much as the 3x we observed last time.

Part of the reason for this, we guessed, was because it was using compose to compose the stream operations instead of directly invoking subsequent stages. We addressed that with this macro:

(define-syntax inline-compose1
  (syntax-rules ()
    [(_ f) f]
    [(_ f1 f ...) (f1 (inline-compose1 f ...))]))

We ran into some logistical difficulties with stale compilation and such, so we weren't able to verify performance after this change.

Outlook

The ideal overall layout of the compiler is as yet unclear, but we were able to minimally incorporate stream fusion and get it working end-to-end.

Next Steps

(Some of these are carried over from last time)

  • Understand why fusion performance in the compiler is slower than clean room performance, and attempt to recover some gains.
  • Implement streams for the other list operations described in St-Amour's writeup (i.e. unfoldr foldr map filter foldl zip average).
  • Agree on the high level layout of compiler passes and implement that structure.
  • Add additional nonlocal benchmarks to validate all cases optimized by stream fusion.
  • Add docs for syntax-spec and put it on the package index.
  • Review other bindings cases that we could support next (e.g. selective binding, packing values, and binding exceptions in try).

Attendees

Dominik, Jair, Michael, Sid, Stephen

Clone this wiki locally