Skip to content

Qi Compiler Sync Aug 25 2023

Siddhartha Kasivajhula edited this page Sep 2, 2023 · 3 revisions

Smashing the Upper Bound for Fun and Profit

Qi Compiler Sync Aug 25 2023

Adjacent meetings: Previous | Up | Next

Summary

We continued working on integrating stream fusion into the Qi compiler. We fixed some bugs in the code from last week, which, it turned out, wasn't actually getting called, explaining why there was no performance difference observed. We also encountered another issue during testing where the performance gain wasn't as much as we were expecting, but that turned out to be a red herring as well, and we are now seeing performance with Qi that is the same as the "clean room" performance we achieved in isolation. Interestingly enough, it may be slightly faster than the performance "upper bound" we'd identified.

Background

Last time, we believed we had incorporated stream fusion into the Qi compiler (specifically for list operations like map and filter), but found that performance appeared to be only slightly better instead of the 3x we were expecting. In the interim, Ben also pointed out that there was something funny about the code. This time we investigated further.

Fixing the implementation

When a print statement we inserted at the point where the stream fusion optimization is implemented didn't show up in the output, we realized that the pattern wasn't actually being matched. The pattern we were matching was:

(define-syntax-class fusable-list-operation
  #:attributes (next)
  (pattern ((~literal map) f)
    #:attr next #'map-cstream-next)
  (pattern ((~literal filter) f)
    #:attr next #'filter-cstream-next))

... in the syntax-parse rule:

  [((~datum thread) f:fusable-list-operation ...+)

We printed the syntax object before parsing it to see why it wasn't matching, and found that there were additional tags added by the expander that we weren't accounting for in the match patterns:

(thread
 (#%partial-application
  ((#%host-expression filter) (#%host-expression odd?)))
 (#%partial-application
  ((#%host-expression map) (#%host-expression sqr))))

The patterns were also losing the functions that were being mapped and filtered over. We modified the match pattern:

(define-syntax-class fusable-list-operation
  #:attributes (f next)
  #:datum-literals (#%host-expression #%partial-application)
  (pattern (#%partial-application
            ((#%host-expression (~literal map))
             (#%host-expression f)))
    #:attr next #'map-cstream-next)
  (pattern (#%partial-application
            ((#%host-expression (~literal filter))
             (#%host-expression f)))
    #:attr next #'filter-cstream-next))

... and eventually updated the expansion to:

(define-syntax inline-compose1
  (syntax-rules ()
    [(_ f) f]
    [(_ [op f] rest ...) (op f (inline-compose1 rest ...))]))

(define (generate-fused-operation ops)
  (syntax-parse (reverse ops)
    [(op:fusable-list-operation ...)
     #'(esc (λ (lst)
              ((cstream->list
                (inline-compose1 [op.next op.f] ...
                                 list->cstream-next))
               lst)))]))

... and that fixed the issue [note this version here is the final one we settled on and not the immediate one that fixed the issue -- see below for more on us chasing the "red herring"].

A Speedy Salmon and a Red Herring

Still, although a newly added test for the filtermap functional pipeline was now passing, the benchmarks still showed a significant difference in performance from what we had observed before. We tried a few different variations on currying in the stream fusion implementation:

A

(begin-encourage-inline
  (define (cstream->list next)
    (λ (state)
      (let loop ([state state])
        ((next (λ () null)
               (λ (state) (loop state))
               (λ (value state)
                 (cons value (loop state))))
         state))))

  (define (list->cstream-next done skip yield)
    (λ (state)
      (cond [(null? state) (done)]
            [else (yield (car state) (cdr state))])))

  (define (map-cstream-next f)
    (λ (next)
      (λ (done skip yield)
        (next done
              skip
              (λ (value state)
                (yield (f value) state))))))

  (define (filter-cstream-next f)
    (λ (next)
      (λ (done skip yield)
        (next done
              skip
              (λ (value state)
                (if (f value)
                    (yield value state)
                    (skip state))))))))

(define-syntax inline-compose1
  (syntax-rules ()
    [(_ f) f]
    [(_ [op f] rest ...) ((op f) (inline-compose1 rest ...))]))

B

(begin-encourage-inline
  (define (cstream->list next)
    (λ (state)
      (let loop ([state state])
        ((next (λ () null)
               (λ (state) (loop state))
               (λ (value state)
                 (cons value (loop state))))
         state))))

  (define (list->cstream-next done skip yield)
    (λ (state)
      (cond [(null? state) (done)]
            [else (yield (car state) (cdr state))])))

  (define (map-cstream-next f next)
    (λ (done skip yield)
      (next done
            skip
            (λ (value state)
              (yield (f value) state)))))

  (define (filter-cstream-next f next)
    (λ (done skip yield)
      (next done
            skip
            (λ (value state)
              (if (f value)
                  (yield value state)
                  (skip state)))))))

(define-syntax inline-compose1
  (syntax-rules ()
    [(_ f) f]
    [(_ [op f] rest ...) (op f (inline-compose1 rest ...))]))

The use of either manual or syntactic currying (i.e. via the (define (()))) were turning out to evade Racket's compiler optimizations and weren't getting inlined. We we were able to see this by using the PLT_LINKLET_SHOW_CP0 flag that we saw last time. So we finally settled on:

Final version

(begin-encourage-inline
  (define-inline (cstream->list next)
    (λ (state)
      (let loop ([state state])
        ((next (λ () null)
               (λ (state) (loop state))
               (λ (value state)
                 (cons value (loop state))))
         state))))

  (define-inline (list->cstream-next done skip yield)
    (λ (state)
      (cond [(null? state) (done)]
            [else (yield (car state) (cdr state))])))

  (define-inline (map-cstream-next f next)
    (λ (done skip yield)
      (next done
            skip
            (λ (value state)
              (yield (f value) state)))))

  (define-inline (filter-cstream-next f next)
    (λ (done skip yield)
      (next done
            skip
            (λ (value state)
              (if (f value)
                  (yield value state)
                  (skip state)))))))

(define-syntax inline-compose1
  (syntax-rules ()
    [(_ f) f]
    [(_ [op f] rest ...) (op f (inline-compose1 rest ...))]))

... which was getting inlined as we were expecting.

Initially, even this appeared to be underperforming in benchmarks, but after we spent a lot of time investigating the CP0 output and convincing ourselves that it was effectively identical to the handwritten "clean room" version, we re-ran the benchmarks and found that this was indeed performing identically to the clean room version, and it was likely a stale compilation that was causing the results to be off, this time.

The final results, with stream fusion operating within the Qi compiler, were:

$ racket -y report-competitive.rkt -s "filter-map" -s "filter-map (large list)"

Running competitive benchmarks...

Running Racket benchmarks...
filter-map:
215 ms
filter-map (large list):
687 ms

Running Qi benchmarks...
filter-map:
45 ms
filter-map (large list):
255 ms

Performance relative to baseline:
((filter-map (large list) 0.37) (filter-map 0.21))

Smashing the Upper Bound

Although we hadn't noticed it before, it seemed in repeated testing that this final continuation-passing version of stream fusion may have been performing (slightly) better than the "true upper bound" we'd identified recently. The CP0 output for the two look like this:

Fusion:

(lambda (lst_52)
  (letrec ([loop_83 (lambda (state_84)
                      (if (#2%null? state_84)
                          '()
                          (let ([app_26 (#2%car
                                         state_84)])
                            (let ([state_67 (#3%cdr
                                             state_84)])
                              (if (#2%odd? app_26)
                                  (let ([value_86 (#3%*
                                                   app_26
                                                   app_26)])
                                    (#2%cons
                                     value_86
                                     (loop_83
                                      state_67)))
                                  (loop_83
                                   state_67))))))])
    (loop_83 lst_52)))

"True upper bound" (hand-coded recursion):

[filter-map-manual (lambda (lst_51)
                                (if (#2%null? lst_51)
                                    '()
                                    (let ([v_52 (#2%car lst_51)])
                                      (if (#2%odd? v_52)
                                          (let ([app_24 (#3%* v_52 v_52)])
                                            (#2%cons
                                              app_24
                                              (filter-map-manual
                                                (#3%cdr lst_51))))
                                          (filter-map-manual
                                            (#3%cdr lst_51))))))]

We considered two possible reasons this could be true (if indeed it were true):

  1. filter-map-manual is a module-level definition and that could make it harder to optimize due to the possibility of set!'ing the identifier elsewhere. But Racket handles this kind of thing and this is unlikely to be the reason.
  2. cdr is used only once but occurs twice in the latter's expansion, which in general could result in larger code size, but in this case shouldn't really cause a difference either.

So, maybe it's nothing. We also discovered that in well-controlled benchmarking, whichever implementation was run first performed consistently worse than the remaining implementations benchmarked, so there appears to be some "warm up" cost, which, it could be that this was the reason for the difference, rather than any inherent difference in the performance of the implementations.

Various Notes

begin-encourage-inline vs define-inline

The former tags the expansions to tell the compiler to inline it at the compilation stage -- but for various reasons, it's not guaranteed to happen. define-inline inlines it at the expansion stage.

Schemified vs CP0

In the compiler output that we see from using flags like PLT_LINKLET_SHOW_CP0, "schemified" shows the code that is the output of Racket that is going to be the input to Chez.

CP0 refers to the first pass of the Chez scheme compiler, which happens right after Chez Scheme expansion, and includes Scheme-level optimizations like inlining.

racket -y

racket -y, a recent addition to Racket, executes a module, recompiling it if there is anything stale in its dependency tree. We did use this while running the benchmarks, so not sure if stale compilation can really account for why we chased the red herring wrt. performance above, but, that's still our best guess as to what happened there.

Other Possible Optimization Approaches

Implement primitive operations on lists

Qi primitive implements are Values → Values, but that's not efficient for cases where the number of values is arbitrary or unknown. So in general, it may make sense to make these operations List → List instead, and convert to and from lists on either end at the interface macro level (e.g. in the ~> Racket macro). This could also be done in cases where there is a host (Racket) expression embedded inside a Qi expression. Note that this approach is orthogonal to stream fusion and is just about operating on lists in the Qi core rather than values.

Next week's meeting

We're moving next week's meeting forward by two hours, i.e. to 7pm UTC instead of 5pm UTC, since Dominik is attending the quarterly meetup of the 2600 club.

Next Steps

(Some of these are carried over from last time)

  • Implement streams for the other list operations described in St-Amour's writeup (i.e. unfoldr foldr map filter foldl zip average).
  • Agree on the high level layout of compiler passes and implement that structure.
  • Add additional nonlocal benchmarks to validate all cases optimized by stream fusion.
  • Generalize stream fusion to other core Qi combinators in addition to ~> for list-based operations.
  • Apply stream fusion to values-based functional pipelines, like (~> (-< (>< f) (>< g)) ...)
  • Add docs for syntax-spec and put it on the package index.
  • Review other bindings cases that we could support next (e.g. selective binding, packing values, and binding exceptions in try).

Attendees

Dominik, Michael, Sid

Clone this wiki locally