Skip to content

Qi Compiler Sync Jan 6 2023

Siddhartha Kasivajhula edited this page Jan 13, 2023 · 2 revisions

A Core Performance Issue Identified

Qi Compiler Sync Jan 6 2023

Adjacent meetings: Previous | Up | Next

Summary

We discussed some specific core forms of the language that are slow and identified a core issue at their heart.

Background

Before the compiler work was undertaken, Qi's implementation had a "flat" architecture, where expansion and compilation were conflated in the same macro. Separating out these two processes meant laying out the implementation in a stratified way, with a distinct expansion step producing core Qi expressions, and a compilation step producing Racket from core Qi. This allowed us to have a small core language. But without any optimizations present, it means that performance is currently worse than before, as formerly optimized local implementations of forms must now go through an unoptimized core language.

Slow Linguistic Forms

We did an analysis to measure performance regression of individual forms and came up with:

'(("live?" "55.71")
  ("count" "33.4")
  ("rectify" "29.76")
  ("none?" "14.48")
  ("all?" "12.97")
  ("AND" "11.77")
  ("any?" "10.79")
  ("OR" "9.69")
  ("any" "9.14")
  ("loop" "8.96")
  ("all" "7.56")
  ("partition" "7.5")
  ("none" "6.33")
  ("pass" "4.72")
  ("sieve" "4.51")
  ("inverter" "4.21")
  ("amp" "4.1")
  ("NAND" "4.09")
  ("or%" "1.86")
  ("select" "1.6")
  ("or" "1.59")
  ("and%" "1.57")
  ("not" "1.5")
  ("NOR" 1)
  ("NOT" 1)
  ("XNOR" 1)
  ("XOR" 1)
  ("and" 1)
  ("apply" 1)
  ("block" 1)
  ("bundle" 1)
  ("catchall-template" 1)
  ("clos" 1)
  ("collect" 1)
  ("crossover" 1)
  ("currying" 1)
  ("effect" 1)
  ("fanout" 1)
  ("feedback" 1)
  (">>" 1)
  ("<<" 1)
  ("gate" 1)
  ("gen" 1)
  ("group" 1)
  ("if" 1)
  ("input aliases" 1)
  ("loop2" 1)
  ("one-of?" 1)
  ("relay" 1)
  ("relay*" 1)
  ("sep" 1)
  ("switch" 1)
  ("tee" 1)
  ("template" 1)
  ("thread" 1)
  ("try" 1)
  ("unless" 1)
  ("when" 1)
  ("esc" "0.68")
  ("ground" "0.58")
  ("thread-right" "0.56"))

... where the numbers represent how much time the forms take in comparison to their original performance (1 implies no change).

The Issue

Investigating the slowest forms, we guessed that the broad pattern resulting in slowness was operating on values in "list-like" ways, e.g. mapping or folding over multiple values. We validated this in this way:

;; Testing the performance of working with large lists, by:
;;   Passing the list as a single argument and returning as a single value
;;   vs.
;;   Passing the list elements as variadic arguments and return values

(define (do-times n f)
  (let loop ([n n])
    (when (> n 0)
      (f)
      (loop (- n 1)))))

(define l (make-list 10000 'x))

(display "Passing a list as a single argument\n")
(time
  (letrec ([f (lambda (args) 5)])
    (do-times 10000 (lambda () (f l)))))

(display "Passing a list via variadic arguments\n")
(time
  (letrec ([f (lambda args 5)])
    (do-times 10000 (lambda () (apply f l)))))

(display "Returning a list as a single argument\n")
(time
  (letrec ([f (lambda () l)])
   (do-times 10000 (lambda () (f)))))

(display "Returning a list as variadic return values\n")
(time
  (letrec ([f (lambda () (apply values l))])
   (do-times 10000 (lambda () (call-with-values f list)))))

For which the output is:

Passing a list as a single argument
cpu time: 0 real time: 0 gc time: 0
Passing a list via variadic arguments
cpu time: 322 real time: 323 gc time: 2
Returning a list as a single argument
cpu time: 0 real time: 0 gc time: 0
Returning a list as variadic return values
cpu time: 536 real time: 553 gc time: 36

This supports that there is a linear cost incurred in passing a large number of values are arguments, as well as in returning a large number of values from a function.

Understanding the Issue

Multiple values in Racket are typically used to represent a fixed and small number of arguments. E.g.: "The first few values returned from a procedure call are placed in the same registers as the first few argument values passed to procedures, and the remainder are stored on the stack as for procedure calls" (Ashley and Dybvig 1994). As a result, passing a large number of arguments would not be able to leverage registers, but evidently also has other problems. TODO: elaborate.

Ways to Improve It in General

  1. Implement forms like loop using lists internally if possible, while still presenting a multiple values input and output interface.
  2. Consider implementing analogues of loop that present a list input and output interface, so that all "list-like" operations are actually done with lists rather than values.
  3. Think differently about implementations in terms of small vs large number of arguments. This could be either in cases where the number is known at compile-time (e.g. amp) or cases where we can obtain this information from an arity-inferring optimization pass.
  4. Modify the Chez Scheme multiple values implementation.

More Discussion

Keyword arguments in Racket and Chez Scheme are implemented in a hierarchical way, where there is a "fast level" where such arguments are supplied directly, and a "slow level" where they are supplied dynamically. This may be worth looking at as it could suggest analogous handling of small vs large number of values in Qi.

Some Specific Improvements

  1. In the implementation of amp, we know that the flow is being invoked with a single argument, so we could use this information instead of treating it as generic variadic application and using loop.
  2. fold-values incurs linear cost on each iteration (quadratic cost overall) due to passing around the accumulator accs as multiple values. Using a list instead could reduce it to linear cost overall.

Restoring Performance to Pre-Stratification Levels

We discussed the possibility of annotating expanded syntax with the a priori versions or some sort of tag during expansion so that the compiler could reliably produce an optimized version of that extended form. But we concluded that this would essentially be the same as expanding the core language, since the same information would be present in the expansion but in a different form (e.g. as a tag, or as syntax properties).

We therefore decided to move the slow non-core forms back to being core forms to fix the local performance regressions.

Next Steps

(Some of these are carried over from last time)

  • Add docs for syntax-spec and put it on the package index.
  • Merge the bindings PR.
  • Start a new PR to move the slow forms back into the core language.
  • Attempt some of the other improvements discussed above.
  • Review the initial design for bindings and see what other bindings cases we could support next.

Attendees

Alex, Hans, Michael, Sid

Clone this wiki locally