Qi Meeting Mar 8 2024

The Elusive Source Syntax

Qi Meeting Mar 8 2024

Adjacent meetings: Previous | Up | Next

Summary

We started to investigate ways to preserve source syntax through expansion in Syntax Spec, for the purposes of providing appropriate error messages. We also merged the integration of the new benchmarking suite (vlibench) with CI so that it will run benchmarks on every commit.

[Note that there are a couple of unaddressed TODOs in the notes below. If you have any context on them, please feel free to edit the post as needed.]

Background

Qi syntax is first expanded and then compiled to produce Racket code that is then expanded and compiled into the bytecode that is ultimately evaluated in the Chez Scheme runtime. If an error is encountered during Qi compilation, we would often like to implicate the source syntax, that is, the code originally entered by the user, rather than the target syntax produced by the compiler. But at this stage in the processing of the input syntax, that source syntax is no longer available. In the recent Qi 4 release that included the optimizing compiler, we had to implement a "de-expander" to reconstruct the source syntax from the target expression for use in error messages, but this is a fragile approach, and the source expression cannot always be reliably reconstructed.

Blame Game Redux

When an error occurs at runtime, the evaluator does not have access to the program syntax at all (TODO: is this right? E.g. with an unhandled error, does a stacktrace include syntax and srcloc information? If so, how does it get that?) and it is up to the runtime error handler – and the programmer – to explicitly provide this information via an appropriate error message.

We've talked before about how, whenever an error occurs during compilation, there are many options for whom to blame, for instance, in the case of a hosted language like Qi, (a) the surface syntax, (b) the macro, (c) the core language syntax, (d) the target language (Racket) syntax.

There is no general rule that can be applied, and which expression should be blamed is decided on a case-by-case basis.

Here are some examples that illustrate each case.

Examples of Each Blame Case

Bad Surface Syntax

(define-qi-syntax-rule (my-amp f)
  (>< f))

(~> (3) (my-amp add1 sqr))

Here, the invocation provides more arguments than my-amp accepts, so we have a misuse of my-amp. The surface syntax (my-amp add1 sqr) should be blamed.

Bad Macro

(define-qi-syntax-rule (my-amp f)
  (>< f g))

(~> (3) (my-amp sqr))

Here, the macro my-amp expands to an invalid use of the core >< form. The macro should be blamed.

Bad Core Syntax

Code generation of Racket from Core Qi is fulfilled by the qi0->racket macro, which, as part of its operation, delegates to helper functions that parse syntax for specific forms.

To illustrate, here is an excerpt showing the operation of qi0->racket and its delegation to the parser for amp. amp-form here is simply a syntax class that matches valid amp syntax like (>< sqr):

(define-syntax (qi0->racket stx)
  (syntax-parse ...
    ...
    [e:amp-form (amp-parser #'e)]
    ...))

(define (amp-parser stx)
  (syntax-parse stx
    [(_ onex:clause)
     #'(curry map-values (qi0->racket onex))]))

map-values is a Racket function, so its argument is expected to be a Racket expression. If we had mistakenly written it this way:

(define (amp-parser stx)
  (syntax-parse stx
    [(_ onex:clause)
     #'(curry map-values onex)]))

… then the argument to map-values could be uncompiled Qi syntax that the Racket expander would not understand. In this case, if we happen to try an expression like (>< (~> sqr add1)), the error might simply say "~>: undefined," as it would assume that onex, in this case (~> sqr add1), is Racket syntax, and that ~> should be either a function or a macro defined in the macro-defining module. There is no such identifier defined there, and so this would be the resulting error.

But this error alone would not be very helpful without context. Where is ~> undefined? Who tried to use that form? It would be better to blame the qi0->racket macro (which delegates to this function as part of fulfilling expansion) for generating invalid syntax here. As this macro is part of the compiler for the core language, this is an instance of blaming the DSL core language.

Bad Target Syntax

It's conceivable that the Qi compiler could end up producing an expression like this:

(lambda (x) (mac x))

If mac is a macro that happens to expect two arguments, then this would be a misuse of mac and we should blame this entire expression, which is an expression of the target language, Racket.

Of course, this may have resulted from an incorrectly constructed Qi expression where the user explicitly passes a single argument to the invocation of mac, but there is no way for Qi to know the expected syntax of mac as it is a Racket macro. As far as it is concerned, Qi produced correct Racket syntax here.

Preserving the Source Syntax

Regardless of who we choose to blame, we should at least have the ability to do so, in the form of a reference to the appropriate syntax.

For the purposes of the Qi compiler, we often desire to blame the source syntax, and would like to preserve a reference to it through expansion.

We got started on this this time around, but it presented some difficulties.

When we last discussed this, we had initially considered two alternatives: (1) preserving the entire sequence of the forms taken on by the syntax, in a data structure (e.g. a stack), or (2) preserving only the immediately preceding syntax object, so that the stack is implicit. But upon further reflection, we felt that either of these options could be expensive, memory wise, and so discussed a third option, (3) preserving and propagating only the original syntax.

Why is preserving the syntax sequence expensive?

Ordinarily, Racket does not retain the source syntax at all, so that once expansion is fulfilled, the preceding syntax object becomes dereferenced and can be garbage collected. Thus, the memory footprint comprises just the currently expanding expression.

By preserving the preceding syntax object (or the entire stack – the cost should be the same in these two cases), we now inflate the memory needs by a factor of N, if there are N syntax objects in the sequence.

TODO: each syntax object is a tree of syntax objects. Does this affect the memory needs so that the cost may be greater than linear? (Possibly: each expansion step produces a syntax object with K component syntax objects. Each of these when expanded does the same, so that the growth in number of syntax objects is exponential, yielding something like K^N syntax objects for N expansion steps).

Is this too high a cost?

We discussed whether there were tools we could use to validate the performance impact we expect, both in time as well as, especially, space. We considered OS tools or Racket native tools such as installing and querying a custodian. But we seemed to conclude that there were no good tools that we knew of, that could be used for measurement purposes, to provide the kind of precise data we are looking for. We do not know, at this time, how to measure memory impact.

Preserving just the original syntax

In any case, we felt that preserving just the original syntax would be a way to get around the (presumed high) memory cost.

The way to propagate the source syntax is via a hidden syntax property attached to the expansion (hidden so as to keep it abstracted behind appropriate APIs). In general, we want to propagate the property if it is present, and otherwise create it with the current (i.e. generating) expression as its value.

But we ran into some difficulties here, including:

we don't want to propagate the source syntax to certain positions (nested macros?), and determining when to propagate vs not may not be easy
if multiple expressions are produced in the expansion, then it would be expensive to traverse them and attach the property to each component [do we actually need to worry about this? would it be sufficient to only propagate to the top level production?]

One option we considered was that in trying to determine the blamable party, if the property is not defined on the syntax, then check the parent syntax, and repeat until we find the property. If the property is not present, then that's an error. But a problem with this is, we may not have a way to access the parent.

We later discovered that Typed Racket has had to tackle a similar problem, and has an approach that might be relevant here in the Source Syntax package.

Rigorous benchmarks on every commit

We reviewed (and after the meeting, merged) a PR to gather new benchmark data.

Now, the new vlibench suite is invoked to give us a rigorous performance report on every commit. Together with our benchmark trends report (which, to be fair, as Dominik quipped, benchmarks GitHub's infrastructure! Still, it is useful to detect changes and give a rough idea of trends. Down the line, it might be possible to normalize the benchmarks against a reference computation each time, which just might make the results invariant to GitHub's changing infrastructure), this puts us in a very good position for monitoring Qi performance over time.

Next Steps

(Some of these are carried over from last time)

Merge "docs arrears" PR containing documentation related to Qi 4, including effect locality, etc.
Bring Qi meeting notes up to date
Review language composition proposal and implement a proof of concept.
Decide on appropriate reference implementations to use for comparison in new benchmarks report and add them.
Deforest other racket/list APIs via qi/list
Review whether we can continue optimizing a single syntax node at many levels
Preserve the source syntax through expansion in Syntax Spec.

Attendees

Dominik, Michael, Sid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly