Qi Compiler Sync Dec 29 2023

Setting a New Benchmark

Qi Compiler Sync Dec 29 2023

Adjacent meetings: Previous | Up | Next

Summary

We discussed a new benchmarking suite that Dominik is working on. We reviewed the recent fix for optimization of inappropriate positions in the syntax tree, noticed some compatibility issues with BC versions of Racket. We discussed priorities and commitments we could make as part of release, and strategies for managing backwards-incompatible transitions.

Background

The compiler work is close to release, and it's a good time to benchmark performance to get a sense of where we are in relation to baseline Racket performance. Separately, the recent issue involving inappropriate optimization of partial fragments of input syntax was fixed, but the PR was reporting some build failures:

Nonterminal blues

Investigating the build failures

We had originally intended to merge the "nonterminal" PR, but when we looked at it we found that CI was failing. Running the job to test on a recent version of Racket CS showed that it was passing, so we tried different configurations of Racket versions and variants until we found that it was specifically on older versions of Racket BC that the build was failing.

Forwards and backwards

What are our commitments to compatibility?

In general, Qi has a forward-looking policy on compatibility, and we have been willing to make backwards-incompatible changes for the benefit of future usability and practicality, while, of course, taking care to provide users a sane migration plan and help with migration as needed. The error we encountered above was something that didn't immediately ring a bell:

write: cannot marshal value that is embedded in compiled code
  value: (srcloc #<path:/home/runner/work/qi/qi/qi-lib/flow/core/normalize.rkt> 74 7 2297 22)
  compilation context...:
   /home/runner/work/qi/qi/qi-test/tests/qi.rkt
  context...:
   /usr/share/racket/collects/compiler/private/cm-minimal.rkt:808:8

We considered whether it was a priority to investigate this issue, and seemed to agree that, as BC is increasingly considered a legacy variant of Racket, it might be acceptable to just offer limited support for BC variants of Racket. In particular, the build still passes on BC >= 8.9 (and CS >= 8.5, as expected) [Note: in subsequent discussion, we felt that we should investigate this issue after all, as it is likely to be a simple fix, and also as it might potentially reveal a problem our build and tests don't otherwise reveal.].

Designing for backwards incompatibility

In any project, things turn out a certain way with certain limited information available up to that point. Over time the project is informed by many different perspectives, and then sometimes, it emerges that choices that were made earlier were not as good as they could be. In such situations, we are faced with a new choice: should we change it to the newly-apparent "right way," retain it the way it was, or find some compromise? Choosing the first path leads to backwards incompatibility. Different projects, for diverse reasons, have different attitudes when it comes to such changes.

Racket is a language with a strong tradition of maintaining backwards compatibility, continuing to support code that was written even decades ago. It has of course at the same time taken many bold and sometimes unpopular steps for the benefit of posterity, such as making the storied cons, set-car and set-cdr list interfaces produce and manipulate immutable cons cells (rather than the mutable ones in the Scheme standard), moving from the PLaneT package server to the current Racket Package Index and a very different package management philosophy, and of course, most recently, the exploration of infix (rather than Lisp's traditional S-expression, prefix-oriented) surface syntax in Rhombus. But backwards compatibility is a very high priority.

Qi is a project at a much smaller scale and with much less historical baggage and expectations, so we are at liberty to have a stronger forward-looking emphasis. So far, we have considered backwards compatibility to be important, but secondary to doing the "right thing." But this should not come at the cost of user happiness, to put it as Matz might. Racket is an incredibly powerful and expressive language ecosystem, and so it should be feasible to write and leverage tools in this ecosystem to minimize the logistical impact of backwards-incompatible changes, as well as design a good user experience for migration on a social level, for any changes that are not automatable. Taking the design of such a process seriously could empower us to be bold in making necessary changes and even free us to make merely desirable changes that may be backwards-incompatible. Of course it goes without saying that such changes should always be done with care. Sometimes, things that stand the test of time do so for a reason that isn't apparent to those who may seek to redo it in "the right way." But with the right mix of boldness and humility, we could achieve the best outcomes in the long term.

Aside from the specific version compatibility issue mentioned above, the upcoming release of Qi does include a few backwards-incompatible changes that are documented in the release notes. We've already taken steps to ensure that users have time and help with migrating their codebases to be ready for these incompatibilities, but we considered whether we could do something more. In particular, we considered whether it would be possible to use a parser of some kind to migrate codebases automatically, according to syntactic rules we specify. Resyntax, a powerful rule-based refactoring tool, could potentially help with this. Another tool that could provide some inspiration is C3PO, which was written to help users migrate contract specifications.

We felt that some of the current backwards incompatibilities are purely syntactic in nature, and could be accomplished automatically by using such a tool. But others, such as recognizing higher order functions and explicitly escaping them, couldn't be accomplished automatically by a syntactic transformation. Still, we agreed that offering such a tool would be a good design goal for the next Qi release (i.e. not the one planned in a few weeks but the following one).

A New Benchmarking Suite

Our existing benchmarks in the Qi SDK are useful enough to give some idea of performance but are fairly rudimentary and also suffer from high variance in their results. They are also an internal tool and not readily accessible to users, and their output is simply text rather than charts, requiring some interpretation by the reader. We also have local benchmarks that are tracked across commits with publicly visible charts, but the data is unreliable for many reasons and useful mainly to signal obvious regressions.

We discussed a new suite of benchmarks that's in the works. We hope that it will address the above shortcomings and be rigorous and consistent, that is, it should give us a high degree of confidence in the results. We'd also want it to afford us the data and tools to investigate any anomalies in detail, and be able to present the results in a clear way (e.g. not just text output but charts and other visualizations), and in a way that is publicly visible and up to date.

Features

We saw a preview of some features that have already been implemented.

Three layers of analysis

The results are organized into three layers:

An individual run
Statistics over N runs (including a standard deviation "wake" around the data line)
Analysis on those statistics to derive insights with 3σ confidence

Specification vs profile

There is a distinction between a benchmark specification and a benchmark profile. The former is a list of programs to benchmark and the input to use in each case. The latter specifies things like the size and jumps in size of the input list, and the number of times to run each evaluation (e.g. 10 times), and the cadence on which to do garbage collection (if at all). The benchmark profile contains preferences affecting the entire report, while the specification defines the benchmarks to be run.

Convergence

The generated report computes the "convergence number" -- that is, the ratio of performance against the baseline that the results converge to with increasing size of the input list.

Underlying Data and Analysis

The underlying benchmark data is retained so that, for example, a histogram of running times could be generated to help us reason about any observations, including anomalies.

Future Work

There are many features that could be added in the future, including local filesystem caching of results to avoid regenerating the full report each time a portion is recomputed.

Investigating and eliminating anomalies

In the initial stages of this analysis in recent days and weeks, we had noticed unexpected "steps" where all trend lines moved up or down, and occasional spikes. Those have now been addressed by performing garbage collection before each measurement (so that it isn't done mid-benchmarking, causing a slowdown), and running things in independent OS threads to minimize interference, so that the trend lines are now smoother. The spikes are caused by concurrent activity on the machine, and should disappear if run on dedicated hardware that isn't being used for anything else.

Why are the results different from our other benchmarks?

Although in most cases the results were roughly consistent with those produced by our existing benchmarks, there was at least one result -- the "long functional pipeline" benchmark -- that was quite different. The old benchmarking suite did not show a significant improvement with Qi, which was unexpected since we did see such improvement for shorter pipelines, and expected the improvement to be more pronounced for the longer pipeline. The new suite does indeed show this greater (and anticipated) improvement.

We explained the discrepancy as follows:

The resolution of time-apply is milliseconds, so that any sub-millisecond differences would incur large errors at scale. For instance, if A takes 0.5 ms and B takes 1.5 ms, these may both be measured as 1ms, even though A is in fact 3x faster than B! In contrast, the new benchmarking suite uses current-inexact-milliseconds which resolves down to fractions of a millisecond. The way in which differences on such a small scale could have a pronounced large scale impact is perhaps analogous to how, in the field of cosmology, anisotropies have been observed in the temperature of the Cosmic Microwave Background -- that is, the left over heat from the Big Bang appears to be slightly different in different directions, and this has been considered surprising. One explanation that has been put forward is that the cause is random quantum fluctuations in the moments after the Big Bang -- very small-scale differences in the early and tiny universe -- which have "inflated" to the large scale of the universe we now inhabit. A long winded and unnecessarily complex analogy to be sure, but perhaps an interesting one!

It would seem that the shorter the duration of individual benchmark runs, the greater will be the error contributed by this phenomenon. It is unclear if the long-functional-pipeline benchmark should be disproportionately affected.

The existing benchmarks conflate setup time and runtime. Thus, if a benchmark is dominated by setup time, then the results would show about even performance between two alternative implementations even if the runtime performance is dramatically different, assuming the setup time is the same for both. Of course, if the setup time also varies, then that further confuses the results.

This phenomenon would especially contribute to a noisy result in cases where the benchmark is either dominated by setup time or if the setup time varies across benchmarked implementations (which probably does not affect us as we are presumably passing the same constructed data to both implementations presenting the same functional interface), or both. It's unclear if the long-functional-pipeline should be disproportionately affected here as well, but perhaps it is some combination of the two factors.

In any event, the new results are self-consistent and in line with what we would expect.

Why isn't the speedup linear?

Another concern was that if the two functional stages in filter-map resulted in a 3x speedup, then with four stages or more in the long-functional-pipeline, shouldn't we expect a 6x or greater speedup? That is, if the speedup is a result of avoiding intermediate representations, then that means we should be constructing one representation instead of N representations, and taking just one pass over the data instead of N passes. Presumably, the 3x speedup comes from the fact that there are "close to 3" (i.e. two) stages in the pipeline, and thus, the speedup factor should continue to remain about N?

When we noted this issue recently, it had turned out that at the time, deforestation wasn't even being performed, an issue that was fixed by propagating the recently-introduced "nonterminal" syntax property across compiler passes (see Nonterminal blues). But even with that fixed, and using the new benchmarking suite, the speedup we actually see after deforestation is much smaller than N.

The resolution we came to is that the speedup is indeed linear, but it is linear in relation to the time taken by the shortest representative pipeline and for a given size list, not in relation to zero. So it's something like A + Nx, where it might be that A is determined by the size of the input list. It may be (say) 2 in the list sizes we use in our benchmarks, and if x is the cost of each stage in the pipeline (say 0.2), that gives us, for a pipeline of length 3, a speedup of 2 + 3*0.2 = 2.6x. For a pipeline of length 6, this would give us 2 + 6 * 0.2 = 3.2x. These are just made-up examples to illustrate the point.

So the expectation of linear speedup is correct, but our first interpretation was incorrect.

Generalizing beyond Qi vs Racket

As the functionality offered by this benchmarking suite could be broadly useful for benchmarking purposes, we considered whether it would make sense to maintain it as a distinct library. We agreed that while that would be valuable, that it would also take extra work to generalize it to that level (for instance, accepting diverse forms of input and not just individual numbers or lists). So for now, we felt it would make sense to add it to the Qi SDK, and then think about modularizing it down the line at which point it could just be a library used in the Qi SDK.

Showcasing the results

We also discussed where these results could be publicly hosted and how they could be kept up to date. As the generated results can be presented in a Scribble document, we felt that incorporating it into our existing CI workflows as a new workflow pushing the built HTML output from Scribble onto a GitHub Pages site would be a good way to do it (similar to our existing performance benchmarking and backup documentation workflows). This would ensure that the performance benchmarks always reflect the latest commit on the main branch.

Next Steps

(Some of these are carried over from last time)

Address the Racket BC compatibility issue
Incorporate the new benchmark suite into the SDK (including Makefile targets) and CI workflows
Choose representative benchmarks to use in the report
Update the main documentation to reflect the compiler work
Review whether we can continue optimizing a single syntax node at many levels (this is not a blocker)
Comprehensively review error messages in Qi
Add test cases for the backwards incompatibility on datum vs literal matching.
Investigate ways to extend the compiler (e.g. language composition or compiler macros) and use that to implement qi/list (deforesting racket/list) as a proof of concept.
Prototype and then propose map1 for addition to Racket.
Preserve the stack of source expressions through expansion in Syntax Spec.
Review the use of contracts in deforestation.
Improve expander tests to compare expressions literally, using a test for alpha-equivalence.
Try unpublished expansion event APIs to implement hierarchical reporting of expansion in the compiler.
Implement streams for the other list operations described in St-Amour's writeup (i.e. unfoldr zip average). Note: implementing average might benefit from having "multi-valued CPS" for deforestation.
Find a way to reliably detect list-like operations on values.
Rewrite these list-like operations to list-based functional pipelines, e.g. (~> (-< (>< f) (>< g)) ...)
Generalize the stream implementation to support multiple values at each step (in addition to zero or one) and recheck the performance.
Review other core Qi combinators besides ~> (e.g. -<, == and ><, maybe if and pass) and see whether it would make sense to extend fusion to them.
Should Qi provide a map utility (overriding the default one) that supports "0 or more" values (as was discussed in the RacketCon talk)?
Design a core continuation form(s?) for Qi, and show how try can be implemented in terms of it.
Describe the sequence of compilation for input expressions with the optimizing compiler against the reference compiler, showing cases in which these do not terminate.
Come up with a methodology to formalize and theoretically validate compiler behavior.
Review other bindings cases that we could support next (e.g. selective binding, packing values, and binding exceptions in try).

Attendees

Dominik, Sid, Stephen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly