Qi Compiler Sync Aug 4 2023

A Chain Reaction for Stream Fusion

Qi Compiler Sync Aug 4 2023

Adjacent meetings: Previous | Up | Next

Summary

We explored the stream fusion compiler optimization for achieving "deforestation," i.e. combining multiple stages of functional transformations into a single stage (similar to Clojure's "transducers"). We also explored a "fold universality" approach as a baseline for comparison. In assessing these, we hand-optimized a simple filter → map sequence to each of these respective optimized versions and took stock. Among other things, we found that employing stream fusion gives rise to a "chain reaction" of further optimizations that would be necessary to achieve theoretical performance.

Background

Sequences of functional transformations are a high-level way to express data transformations without needing to specify at a granular level what happens to each element in the original list using some form of iteration. But such sequences end up building intermediate list representations that take up more memory and increase the processing time, so we'd like to be able to automatically optimize them to avoid the computational cost that comes with the higher level of abstraction. The survey paper by Vincent St-Amour is a great starting point to understand the motivation for this optimization as well as the many different approaches to implementing it.

The functional sequence we used for benchmarking this time was a simple filter → map sequence, whose implementation in Racket is:

(define (filter-map lst)
  (map sqr (filter odd? lst)))

The existing filter-map benchmark in the Qi SDK invokes this function (and, independently, the corresponding Qi version) a certain number of times with a list of size 10. We felt that with a small list like this, the benchmark result could be dominated by the cost of wrapping with lambdas and compose and other incidental details of the optimized implementation, so we added another benchmark identical to this one except that it uses a list of size 1000. This benchmark is called filter-map (large list). Both of these appear in the benchmarking results below.

Fold Universality

The universality of the fold operation allows it to be used to express other functions like map and filter. Together with the unfold operation which is used to generate sequences, a lot of functional pipelines could be expressed in a single step, and is the approach taken by Wadler in 1981 (per St-Amour).

As a baseline for benchmarking our optimization, we rewrote filter-map to this:

(define (filter-map lst)
  (foldr (λ (v vs)
           (if (odd? v)
               (cons (sqr v) vs)
               vs))
         null
         lst))

We replaced the "Qi" implementation of this benchmark with this so that we could use the existing benchmarking scripts to compare against Racket:

$ make profile-competitive

... and this showed the following results:

Performance relative to baseline:
...
 (filter-map (large list) 0.74)
 (filter-map 0.55))

So the optimized version is about 50% faster than the original.

This approach provides an upper bound on the performance gains that can be expected from optimizations, but the problem with actually doing it this way in the compiler is that if there are N functional components, there are N² possible optimization rules, making it potentially tedious to implement if N is large. Additionally, it seems that not all functional pipelines can be expressed this way [St-Amour].

Stream Fusion

This is the state-of-the-art approach described in St-Amour's survey paper, and is used in Haskell's GHC compiler. This approach involves employing a Stream datatype composed of a next function and a state, which is able to express arbitrary sequences by suitably tailoring the definitions of next and state. Sequences of functional transformations are represented as the composition of a sequence of streams, yielding a single composite stream (hence stream "fusion").

We implemented it this way, following the definitions in St-Amour:

Stream data type

(struct stream (next state)
  #:transparent)

(define (list->stream lst)
  (define (next state)
    (cond [(null? state) 'done]
          [else (list 'yield (car state) (cdr state))]))
  (stream next lst))

(define (stream->list s)
  (match ((stream-next s) (stream-state s))
    ['done null]
    [(cons 'skip state)
     (stream->list (stream (stream-next s) state))]
    [(list 'yield value state)
     (cons value
           (stream->list (stream (stream-next s) state)))]))

Map and Filter streams

(define (map-stream f s)
  (define (next state)
    (match ((stream-next s) state)
      ['done 'done]
      [(cons 'skip new-state) (cons 'skip new-state)]
      [(list 'yield value new-state)
       (list 'yield (f value) new-state)]))
  (stream next (stream-state s)))

(define (filter-stream f s)
  (define (next state)
    (match ((stream-next s) state)
      ['done 'done]
      [(cons 'skip new-state) (cons 'skip new-state)]
      [(list 'yield value new-state)
       (if (f value)
           (list 'yield value new-state)
           (cons 'skip new-state))]))
  (stream next (stream-state s)))

And with these available, we implemented filter-map this way:

Filter-map

(define (filter-map lst)
  (let ([s (list->stream lst)])
    (stream->list (map-stream sqr (filter-stream odd? s)))))

Benchmarking showed:

Performance relative to baseline:
...
 (filter-map (large list) 1.81)
 (filter-map 1))

A Chain Reaction

The results with stream fusion showed that we were slower than Racket by about 80%. This was not unexpected, since stream fusion in the GHC Haskell compiler assumes the existence of a number of other compiler optimizations that allow it to be efficient, many of which are absent in Racket. We considered what it would take for the Qi compiler to be able to make stream fusion similarly efficient, and started to implement these optimizations by hand as successive stages of optimized code.

Inlining

A standard compiler optimization, this expands function definitions "inline" to avoid the cost of function invocation (e.g. involving use of the stack). This is what filter-map looked like with this optimization:

(define (filter-map lst)
  (define (next-list->stream state)
    (cond [(null? state) 'done]
          [else (list 'yield (car state) (cdr state))]))
  (let ([s (stream next-list->stream lst)])
    (define (next-filter-stream state)
      (match ((stream-next s) state)
        ['done 'done]
        [(cons 'skip new-state) (cons 'skip new-state)]
        [(list 'yield value new-state)
         (if (odd? value)
             (list 'yield value new-state)
             (cons 'skip new-state))]))
    (let ([s (stream next-filter-stream (stream-state s))])
      (define (next-map-stream state)
        (match ((stream-next s) state)
          ['done 'done]
          [(cons 'skip new-state) (cons 'skip new-state)]
          [(list 'yield value new-state)
           (list 'yield (sqr value) new-state)]))
      (let ([s (stream next-map-stream (stream-state s))])
        (stream->list s)))))

Performance was unchanged from before:

Performance relative to baseline:
...
 (filter-map (large list) 1.86)
 (filter-map 1)

Partial Evaluation of Accessors and Constructors

Next, we identified occurrences of constructors followed by accessors, where we could do a transformation like this:

(let ([my-list (cons x y)]) ... (car my-list))
->
(let ([my-list (cons x y)]) ... x)

... which is an instance of what's called "partial evaluation," since we're essentially evaluating the source code in a selective way.

This gave us:

(define (filter-map lst)
  (define (next-list->stream state)
    (cond [(null? state) 'done]
          [else (list 'yield (car state) (cdr state))]))
  (let ([s (stream next-list->stream lst)])
    (define (next-filter-stream state)
      (match (next-list->stream state)
        ['done 'done]
        [(cons 'skip new-state) (cons 'skip new-state)]
        [(list 'yield value new-state)
         (if (odd? value)
             (list 'yield value new-state)
             (cons 'skip new-state))]))
    (let ([s (stream next-filter-stream lst)])
      (define (next-map-stream state)
        (match (next-filter-stream state)
          ['done 'done]
          [(cons 'skip new-state) (cons 'skip new-state)]
          [(list 'yield value new-state)
           (list 'yield (sqr value) new-state)]))
      (let ([s (stream next-map-stream lst)])
        (stream->list s)))))

Dead Code Elimination

Now that we were essentially using primitive data directly without constructing the richer types forming the abstraction for the benefit of the programmer, there were many unused binding forms where such types were constructed. We eliminated these:

(define (filter-map lst)
  (define (next-list->stream state)
    (cond [(null? state) 'done]
          [else (list 'yield (car state) (cdr state))]))
  (define (next-filter-stream state)
    (match (next-list->stream state)
      ['done 'done]
      [(cons 'skip new-state) (cons 'skip new-state)]
      [(list 'yield value new-state)
       (if (odd? value)
           (list 'yield value new-state)
           (cons 'skip new-state))]))
  (define (next-map-stream state)
    (match (next-filter-stream state)
      ['done 'done]
      [(cons 'skip new-state) (cons 'skip new-state)]
      [(list 'yield value new-state)
       (list 'yield (sqr value) new-state)]))
  (let ([s (stream next-map-stream lst)])
    (stream->list s)))

"Case-of-case"

Next, we tried inverting the order of conditional expressions (in our case, cond and match) which in some cases can be faster. This is called the "case-of-case" optimization in Haskell.

(define (filter-map lst)
  (define (next-list->stream state)
    (cond [(null? state) 'done]
          [else (list 'yield (car state) (cdr state))]))
  (define (next-filter-stream state)
    (cond [(null? state) (match 'done
                           ['done 'done]
                           [(cons 'skip new-state) (cons 'skip new-state)]
                           [(list 'yield value new-state)
                            (if (odd? value)
                                (list 'yield value new-state)
                                (cons 'skip new-state))])]
          [else (match (list 'yield (car state) (cdr state))
                  ['done 'done]
                  [(cons 'skip new-state) (cons 'skip new-state)]
                  [(list 'yield value new-state)
                   (if (odd? value)
                       (list 'yield value new-state)
                       (cons 'skip new-state))])]))
  (define (next-map-stream state)
    (match (next-filter-stream state)
      ['done 'done]
      [(cons 'skip new-state) (cons 'skip new-state)]
      [(list 'yield value new-state)
       (list 'yield (sqr value) new-state)]))
  (let ([s (stream next-map-stream lst)])
    (stream->list s)))

At some intermediate point, we ran the benchmarks again and discovered that things had gotten slower. I don't recall exactly when we did this, but it might have been somewhere around this point:

Performance relative to baseline:
...
 (filter-map (large list) 1.95)
 (filter-map 1.78)

... certainly, a case (of case) of things going in the wrong direction, but it's possible that things could get slower at intermediate stages before becoming faster by being able to take advantage of subsequent optimizations that would not have been possible prior to these transformations. For instance, a next step would be to partially evaluate the match form on known arguments, which could allow us to eliminate some branches or possibly even entire conditionals.

Playing to Racket's Strengths

Another question to consider in forming an optimization strategy is, what are some optimizations that Racket is especially good at? Could we rewrite code in such a way that it would trigger these more performant optimizations (which Racket would not ordinarily apply due to different invariants, as discussed below in Outlook)?

Outlook

Stream fusion as an optimization layered on top of a rich existing optimization stack is attractive due to its ability to represent arbitrary functional pipelines (thus avoiding the N² rules that would be necessary with fold universality) and its high performance. But without these optimizations being available in Racket, it gives rise to a "chain reaction" of optimizations that would need to be written in the Qi compiler, which feels like it would be overstepping the boundary between Qi and the host language (Racket).

For some of these optimizations, it's conceivable that they could be implemented in the Racket compiler, but on the other hand, Racket ensures a lot of language invariants that would make such optimizations backwards incompatible -- that is, such optimizations may break invariants in the presence of side effects, and that may not be acceptable from Racket's perspective. So, an argument could be made that it could be OK for the Qi compiler to do such optimizations, as they are in relation to invariants that Qi ensures, rather than invariants that Racket ensures. And in particular, a core tenet of Qi's "theory of optimization" is that we will make no guarantees about "incidental" side effects, and only ensure defined behavior when side effects are explicitly indicated either using the effect form, or by escaping to Racket via esc or gen.

On the other hand, with fold universality, the N² rules are only really a problem if we expect N to be large. For small N, it would be easier to write out these N² rules in the compiler than to implement the chain reaction of optimizations that would be necessary for stream fusion.

For now, we intend to continue trying out the tower of optimizations on specific examples to see if we can reach fold performance using stream fusion, mostly for entertainment and edification. But it appears as though using fold universality is a leading contender for the implementation of deforestation at this time.

Another intriguing possibility is that we may be able to find some combination of stream fusion and fold universality that would simultaneously:

Minimize the number of rules to be written by pre-unifying some of the different patterns using stream fusion, so that there would be fewer input patterns for fold universality to optimize.
Support more functional pipelines than fold universality does. St-Amour's survey gives the impression that stream fusion can optimize all cases, while fold universality can only optimize some. Could employing the former as an input to the latter allow us to retain this expressiveness without triggering the chain reaction?

Next Steps

(Some of these are carried over from last time)

Continue on the chain reaction of optimizations and attempt to reach baseline fold performance for filter-map.
Try hand optimizing a new nonlocal benchmark: sum (map square (upto 1 n)) (example from St-Amour's writeup).
Decide on the implementation.
Implement deforestation in the compiler.
Add docs for syntax-spec and put it on the package index.
Review other bindings cases that we could support next (e.g. selective binding, packing values, and binding exceptions in try).

Attendees

Jair, Michael, Sid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly