Skip to content

Qi Compiler Sync Mar 2 2023

Siddhartha Kasivajhula edited this page Mar 11, 2023 · 4 revisions

Return to Baseline Performance

Qi Compiler Sync Mar 2 2023

Adjacent meetings: Previous | Up | Next

Summary

We reviewed the latest performance regression profile and made the necessary changes to bring performance back to pre-stratification levels, and also talked about a few other things.

Background

After the recent discovery of the loophole in Qi space, many forms were reimplemented to take advantage of it, so that forms that don't need to be part of the core language may nevertheless have custom, optimized Racket implementations. A few other forms were simply promoted back to being core forms. After these changes, the latest performance regression profile looked like this:

$ cd qi-sdk/profile
$ ./report.rkt -r path/to/before.json

'(("feedback" "1.51")
  ("relay" 1)
  ("not" 1)
  ("NAND" 1)
  ("switch" 1)
  ("bundle" 1)
  ("partition" 1)
  ("or%" 1)
  ("unless" 1)
  ("amp" 1)
  ("crossover" 1)
  ("ground" 1)
  ("count" 1)
  ("select" 1)
  ("none" 1)
  ("all?" 1)
  ("XOR" 1)
  ("inverter" 1)
  ("input aliases" 1)
  ("OR" 1)
  ("catchall-template" 1)
  ("gate" 1)
  ("gen" 1)
  ("loop2" 1)
  ("AND" 1)
  ("NOT" 1)
  ("XNOR" 1)
  ("when" 1)
  ("loop" 1)
  ("one-of?" 1)
  ("tee" 1)
  ("any?" 1)
  ("collect" 1)
  ("fanout" 1)
  (">>" 1)
  ("sieve" 1)
  ("apply" 1)
  ("template" 1)
  ("block" 1)
  ("<<" 1)
  ("group" 1)
  ("and%" 1)
  ("(require qi)" 1)
  ("rectify" 1)
  ("live?" 1)
  ("pass" 1)
  ("if" 1)
  ("and" 1)
  ("none?" 1)
  ("effect" 1)
  ("sep" 1)
  ("relay*" 1)
  ("try" 1)
  ("all" 1)
  ("esc" 1)
  ("clos" 1)
  ("or" 1)
  ("thread" 1)
  ("any" 1)
  ("thread-right" "0.66")
  ("currying" "0.54")
  ("NOR" "0.42"))

This showed that feedback was the only form that clearly performed worse than before.

Feedback

The main change in the implementation of feedback was from this:

#'(feedback-times (qi0->racket onex) n (qi0->racket thenex))

... to:

#'(lambda args
    (apply (feedback-times (qi0->racket onex) n (qi0->racket thenex))
           args))

... in order to delay the resolution of bindings until after compilation.

This involves an extra conversion of values to lists and then back to values again, but it is necessary as bindings would otherwise cause an error as being unresolved at compile time.

We looked at the implementation of feedback-times:

(define (power n f)
  (apply compose (make-list n f)))

(define (feedback-times f n then-f)
  (compose then-f (power n f)))

... and modified it to repeatedly call the function on inputs instead of composing the function beforehand:

(define (feedback-times f n then-f)
  (lambda args
    (if (= n 0)
        (apply then-f args)
        (call-with-values (lambda () (apply f args))
                          (feedback-times f (sub1 n) then-f)))))

... and this brought performance to parity with the original.

The Fly in the Loophole

In order to restore performance of partition, it was promoted to a core form. This didn't seem quite right though since naively it can be expressed as repeated applications of sieve, in an analogous way to how cond can be expressed as if, and indeed, only if is part of core Racket, and cond is expressed as a macro in terms of it. We considered whether the loophole might be more appropriate for it, or whether it should instead compile to sieve, with sieve having a more efficient implementation.

For the loophole, we considered implementing partition as a macro like this:

(define-qi-syntax-parser partition
  [(_:id)
   #'ground]
  [(_ [cond:clause body:clause])
   #'(~> (pass cond) body)]
  [(_ [cond:clause body:clause] ...+)
   #:with c+bs #`(list (cons #,(flow #'cond) #,(flow #'body)) ...)
   #'(#%blanket-template (partition-values c+bs __))])

Note that partition-values is a function, so that the above macro expands either to other Qi forms or to a Racket function in Qi space (i.e. the "loophole"). But after thinking about it, we concluded that the loophole was not appropriate here since although we gain optimized implementation without introducing the form into the core, we also lose the ability to optimize it further within the compiler. In general, it seems, the loophole is appropriate in cases where we are willing to say that we know nothing statically about the code that would allow us to optimize it.

Now, regarding compiling to sieve, on the face of it, we noted that variadic operations are not efficiently composable with one another, as there is a values -> list -> values conversion entailed in each composition. This is typically bad for cases where we are treating values as if they were elements of a list (which is the case in many core forms like loop and feedback), but that didn't appear to be the case with sieve, suggesting that sieve could potentially have an implementation that would be as efficient as the optimized partition. We didn't find an obvious such implementation, and concluded that while there were ways to improve the efficiency of sieve (such as potentially computing the filtered and excluded values in one step instead of repeating the filtration on the entire input), that the optimized partition would probably be faster.

So we left partition as a core form for now.

Core values or core lists?

A prominent bugbear in performance as far as Qi is concerned is the necessity to frequently convert between values and lists and back again. This is because variadic functions in Racket receive their arguments by packing them into a list. Then, in order to generate values again, the function must once again convert the result into values. This is a linear cost that could easily become quadratic if it is done recursively.

One possible way to avoid this could be to explore exclusively operating on lists internally in the implementation of Qi -- that is, core functions implementing the language would all operate in terms of lists rather than values. The values to list and list to values conversions would only occur at the boundary of the language with Racket. For this purpose a form like esc, or another variant esc2, could do the translation from values to lists and back again.

For that matter, based on static information known (such as number of arguments) even other data structures such as vectors may be used in some cases.

The design implications of this remain to be understood.

Shrinking the Core Language

Many forms are implemented as core forms at the moment because there isn't an efficient looping form in the core language that they could use as macros. If loop operated on lists, this would allow many forms such as all, any, and others, to be implemented as macros.

Other things

Weird error message

We discussed a weird error reported by syntax-spec that had been encountered during refactoring:

p.fspec: attribute contains non-syntax value

This was due to a bug and it was fixed.

Use of ~> in compiler optimizations?

We need to use thread rather than ~> in compiler optimizations at the moment as the former collides with the syntax-spec form of the same name. There are ways this could be circumvented but none of them are especially appealing at this time. It was noted but as it isn't user-facing it isn't a priority to address it.

Benchmarking for for/or and for/and

We reviewed some benchmarks for for/or and for/and in connection with their use in the implementation of none?, and concluded that they should perform very similarly as their expansions are very similar and they both short-circuit.

Next Steps

(Some of these are carried over from last time)

  • Prepare the form performance PR for review and merge it
  • Write non-local benchmarks to track the effect of compiler optimizations soon to be written.
  • Rebase the "first optimizations" branch and resume writing some initial compiler optimizations to have the basic layout of optimizations in place.
  • Add docs for syntax-spec and put it on the package index.
  • Review the initial design for bindings and see what other bindings cases we could support next.

Attendees

Michael, Sid

Clone this wiki locally