Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: design of ISLE instruction-selector DSL. #15

Merged
merged 4 commits into from
Nov 15, 2021

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Aug 19, 2021

Rendered

This RFC proposes a new DSL for the instruction selectors in Cranelift,
automating the generation of the backends based on a list of term-rewriting
rules provided by the backend author.

This is based on extensive discussions surrounding the ideas in
#13 and is meant to be a refinement of earlier
ideas that is compatible with our goals related to verification, and is
able to build on top of Peepmatic's automaton abstraction.

cfallin and others added 3 commits August 17, 2021 01:01
This is based on extensive discussions surrounding the ideas in
bytecodealliance#13 and is meant to be a refinement of earlier
ideas that is compatible with our goals related to verification, and is
able to build on top of Peepmatic's automaton abstraction.
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
@cfallin
Copy link
Member Author

cfallin commented Aug 19, 2021

cc folks who participated in #13 and/or may be interested due to involvement in Cranelift backends: @fitzgen @sunfishcode @tschneidereit @abrown @jlb6740 @akirilov-arm @afonso360 @bjorn3

Copy link
Contributor

@sparker-arm sparker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice and clean.

Sorry if I've missed it, but my main confusion is over how extractor functions are defined, is the nitty gritty matching done in Rust and will just reuse the existing code?

Also, would we be able to express instructions that have multiple outputs, such as pre/post indexed memory operations?

The basic strategy is to "inline" the rule-matching that would apply
to the newly constructed value on the right-hand side of a rule. In
other words, if a rule transforms `(A x y)` to `(B x y)` and there is
another rule that matches on `(B x y)`, we generate code that chains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this mean we can write generic, and machine-level, pattern reducers as part of isel then? This suggests that 'B' could be a machine node and so we can use machine nodes as inputs. If so, could this be viewed as multiple peepmatic passes: (1) IR-to-IR reductions, (2) IR-to-MachInst ISel, (3) MachInst reductions? And then would having multiple passes have benefits over inlining while the tree is still being selected? Maybe splitting it into the two later phases would ease implementation and verification. And could we do some lowering with the same method - especially stuff like expanding unsupported SIMD operations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the DSL itself is agnostic to whatever the input and output formats are (that depends on the "bindings" one writes with external extractor and constructor definitions), so in principle one could have all of the above.

Reductions in the legalization sense, e.g. lowering a 64-bit add to two 32-bit adds (to take a simple example), should be possible as well. This has implications for the term stratification approach to ensuring termination, so we would need to expand that a bit (the RFC notes that allowing substructural recursion, or maybe just "recurse during inlining up to limit k") to allow the lowering code to be generated properly. But that's absolutely a goal -- writing fallback rules like this then composing with machine-specific rules is much more powerful than the support-all-features-on-all-machines approach we currently have (e.g. adding i128 separately on x64 and aarch64).

We could also have multiple passes; @fitzgen and I discussed this a bit yesterday actually. For now I think it makes sense to start with the single lowering pass, because we have the existing machinery for that, but once the DSL compiler exists, it should be relatively straightforward to use it to generate another pass as well.

denotational value of a term is the symbolic or concrete value
produced by the instruction it represents (depending on the
interpretation); so "all" we have to do is to write, e.g.,
pre/post-conditions for some SMT-solver or theorem-prover that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds cool, do you know if someone else is doing this? Sounds like it would be a hard problem for instructions that read/write global state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are various research folks working on compiler verification; e.g. this paper from recent ASPLOS 2021. The general term is "translation validation" when it involves proving a specific input and output are equivalent, I think. I don't see the core Cranelift team diving into this fulltime but enabling it is very important, and there are academics who might be interested!

Copy link
Contributor

@sparker-arm sparker-arm Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the paper - interesting stuff! From my brief interactions with cranelift so far, the key differences that I see with cranelift and LLVM are that neither levels of IR are defined as well in cranelift, for example:

  1. I don't think that the aarch64 backend actually models pstate flag setting.
  2. The CLIF icmp has some ugly target defined behaviour...

So I think having some verification would be an excellent driver for improving our internal semantics.

@bjorn3
Copy link
Contributor

bjorn3 commented Aug 19, 2021

One question I have before reading this RFC, would it be possible to have backend agnostic rules for things like 128bit ints or 64bit ints on 32bit systems? Kind of like the old legalization system, except without having to edit the clif ir in-place. As different architectures work differently, it may make sense to have partially or fully different rule sets for architectures with and without flags. In addition it should be possible to override rules for specific backends if they can do it more efficiently.

rules differently: for example, we might test/fuzz a backend by
applying rules in a different order. One should also be free to tweak
priorities to attempt to optimize, without fear of breaking the
semantics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to tell when multiple rules can apply to the same input and maybe allow hints to acknowledge the overlap. These hints could then also contain the advisory priority or the rules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this gives me an idea for a DSL-compiler mode that produces some "explainer" output -- which overlaps occurred and which prioritization decisions were made. That could possibly be useful. Also if a rule is never used (because it's overshadowed by some other higher-priority rule that completely overlaps it) that is probably worth a warning.

```plain

(rule (Iadd a b)
(X86Add a b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion for an alternative syntax to consider:

(Iadd a b) => (X86Add a b);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally hoping to stay true to S-expression form here (so the toplevel forms/clauses are (KEYWORD ...)) but => as the keyword ((=> (A x) (B x))), which is what Peepmatic does, could work. Others please let us know your favorite bikeshed color below!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more suggestion: Please enforce a specific formatting of the S-expressions on CI in rustfmt style. That is a 1-to-1 map from parsed AST to the resulting string. Not like eg clang-format which tries to preserve newlines and things like that and only normalizes things like spaces.

```plain
;; 12-bit immediate form.
(rule (iconst val)
(when (is-aarch64-imm12-value val)) ;; Conditional predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to infer this using "typechecker" that takes constraints from the output instructions and propagates them to the input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels a bit too magic/implicit, but what can be done instead is that constructors that build the final instructions take strongly-typed values that can only be produced by extractors that check the relevant conditions. So the plan is to have e.g. an Aarch64Imm12 value type, and this can only represent legal values. Similar type-encoded conditions can be used for other properties.

Copy link
Contributor

@afonso360 afonso360 Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also want to use this for things that are not typecheckable, such as a processor extension predicate.

(rule (prio 20) (iadd (i8x16 a) (i8x16 b))
      (when (has-avx512))
      (avx512-add a b))
      
(rule (prio 10) (iadd (i8x16 a) (i8x16 b))
      (when (has-3dnow))
      (3dnow-add a b))

Or, would these cases need a custom extractor to mach them? instead of a generic i8x16?

Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
@cfallin
Copy link
Member Author

cfallin commented Aug 19, 2021

Sorry if I've missed it, but my main confusion is over how extractor functions are defined, is the nitty gritty matching done in Rust and will just reuse the existing code?

Extractors are defined in Rust, yeah, so it's sort of an FFI to "runtime library code" (from the point of view of the DSL). But they should be low-level: I expect that we'll have extractors for the LowerCtx API, e.g. fetching instructions that feed into operands, fetching immediates, and that sort of thing, plus machine-specific things like the various immediate formats (imm12, immlogic), plus maybe other escape hatches to general logic where we really need it.

Also, would we be able to express instructions that have multiple outputs, such as pre/post indexed memory operations?

This is an interesting one! In the most general form it needs multi-rooted DAG matching (i.e. we need to find the load/store and the add-to-pointer separately), which needs adjustments to our lowering loop as well. I don't see why a term-rewriting system in general couldn't express that, but it needs further thought.

@cfallin
Copy link
Member Author

cfallin commented Aug 19, 2021

One question I have before reading this RFC, would it be possible to have backend agnostic rules for things like 128bit ints or 64bit ints on 32bit systems? Kind of like the old legalization system, except without having to edit the clif ir in-place. As different architectures work differently, it may make sense to have partially or fully different rule sets for architectures with and without flags. In addition it should be possible to override rules for specific backends if they can do it more efficiently.

Indeed! The hope is that we can get back to that universe where we have fallback rules. In particular this should make e.g. fuller i128 support across all platforms, and also the eventual buildout of our 32-bit platform support (if someone is sufficiently motivated to drive it), much easier.

Copy link
Contributor

@afonso360 afonso360 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this! I think that this would greatly simplify backends (at least the aarch64 backend).

I'm slightly concerned about compile times. Working on LLVM and editing a .td file in one of the targets is not a great experience. It usually involves at least a 1 min build for my laptop, which is really disruptive. And while I now have a much better experience with a different machine, I would hate for us to find ourselves in the same situation.

I don't know if this is going to be an issue, but we may want to think about ways to mitigate it if it does.

Note that a constructor, unlike an extractor, cannot fail. More on the
execution model below.

## Defining Rules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the aarch64 backend we have emit_shl_i128 which emits 9 instructions, and is used 4 times while lowering other instructions. In the rotr/rotl case we end up emitting 21 instructions (1 i128 shr + 1 i128 shl + 3 instructions), it would be nice not to have to repeat the shl/shr instructions 4 times in different orders.

Would we do this via a (shl-i128 a b) constructor? Or can we do this as a separate rule which gets matched as well?

This feels like an edge case, not sure if it happens elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar, but somewhat different case here would be popcnt, where we emit a number of instructions, and then match a single instruction depending on the type of the input.

But here, since the total size is about 6 instructions in 5 different input types it isn't as bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading the RFC now, it looks like the shl case would be a good use of Virtual Internal Terms. popcnt might be too small for this to make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, that's the intent of the ability to freely (in terms of runtime cost) create these virtual terms that chain rules together -- they allow additional factoring and abstraction.

@cfallin
Copy link
Member Author

cfallin commented Aug 20, 2021

I'm slightly concerned about compile times. Working on LLVM and editing a .td file in one of the targets is not a great experience. It usually involves at least a 1 min build for my laptop, which is really disruptive. And while I now have a much better experience with a different machine, I would hate for us to find ourselves in the same situation.

I don't know if this is going to be an issue, but we may want to think about ways to mitigate it if it does.

I agree this could be a problem -- we'll have to see what it looks like when we have a prototype ready (it's coming soon-ish!).

There are really three aspects to this, and we could make a difference to the first two with a certain change:

  • Compile time of the DSL metacompiler. This depends in part on what we pull in (e.g., Peepmatic uses Z3 for typechecking, which takes a long time to build; my current prototype does not; there are other dimensions in which we could depend on an external crate or do something simpler internally) and on the complexity of the metacompilation in general.
  • Compile time of the DSL, to generate Rust code (i.e., the tool invocation itself). We can be careful not to have anything too fancy in the compilation (e.g. inefficient algorithms when combining rules or building automata) -- the work on Peepmatic should at least give us a really good start here.
  • Compile time of the code generated by the DSL compiler. This is impacted by our DSL compilation strategy: if we generate a really big Rust function with all the matching cases inlined and significant duplication, etc., it could be really bad; conversely if we generate carefully factored Rust code it may not be bad at all. This cost is paid whenever compiling Cranelift, even if not modifying the DSL.

At least the first two (building the metacompiler, and invoking the metacompiler to generate Rust code from the DSL) could be removed from the Cranelift build time when the DSL code is unchanged by checking the generated Rust code into source control. I'll note that this is not the idiomatic Rust approach at all (that would be to run this from build.rs and put the output in target/); but it's an option, and does have some other attractive properties, e.g. making the generated code much easier to find for anyone who needs to follow or debug it. On the other hand, it adds potential for surprising errors when it gets out-of-sync.

The edit-run debug loop depends on the last two only (presumably the DSL metacompiler will be built once then won't be changed if one is hacking on a backend); we get all three only when hacking on the metacompiler itself.

Anyway, whether or not to check in the generated artifacts is a question we can discuss here as well -- thoughts?

(AddConst (ValueReg b) c))

; Add with load on one input.
; Note the `FromInstOnlyUse` which ensures we match only if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this FromInstOnlyUse has disappeared below?

; Add with constant on one input. `I32Value` extractor matches a `u64`
; immediate that fits in an `i32` (signed 32-bit immediate field).
(rule (Lower (Iadd (InsnValue (I32Value (Iconst c))) b))
(AddConst (ValueReg b) c))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking at this, I'm thinking that constant propagation / strength reduction (maybe other optimizations) could be carried out within this framework as well, but those would operate in higher-level IR types (CLIF, not VCode) that aren't tied to a specific architecture. If I understand correctly, one could have two sets of declared patterns (i.e. in two files), each declaring its IR type, the set of rules, etc. and we could call the two resulting lower functions at different times during the optimization passes.

(type Inst extern (enum
(Add (a Input) (b Input))
(Load (addr Input))
(Const (val u64))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to be harder to inspect some characteristics of the given resulting Rust enum: what are its size? which variant is the biggest (and thus determines the whole enum's size)? Reducing the memory footprint of IR data structures has been a fruitful exercise in the past, causing reductions in the number of malloc calls and thus compile-time speedups, and I'm a bit worried that this kind of optimization opportunities, that was very easy to implement in the past, will become unbearable within this framework, except for the framework author(s).

On the other hand, one could argue that any improvement to the generic framework would benefit all the backends at once (but that's just a benefit from a generic framework: errors/bugs would be shared too!). Still, it doesn't seem like it's fitting the initial goal of being welcoming and simple for newcomers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so a key thing here is that this extern type decl is referring to an enum that's already defined in Rust code elsewhere. Perhaps we could find a better way of doing it later, but for now the idea is to declare the view of the types that we need to generate Rust code against them, but not to generate the type definitions themselves. So during the migration, we still have the enum Inst in isa/*/inst/mod.rs, and we can still e.g. write tests against its size and play with its packing.

One could argue that it'd be better to generate both from a single source of truth, and I think that has to be the eventual answer; in that case one could still write a unit-test against std::mem::size_of::<x64::Inst>() or whatnot but changing it would mean editing the DSL.

A key factor in approachability IMHO (this may be controversial; please discuss!) could be to include the generated code in the git repo. Aside from implications on build-time above, it would mean that (i) the actual Rust definitions, at the stage in our evolution when they are all generated from the DSL, should be easy to find; and (ii) we can add a note to the top of the file saying "autogenerated, please see file.isle" and maybe on the actual enum a comment like "autogenerated from definition Foo in file.isle line 123".

I do think that there is an interesting philosophical debate about the definition of "welcoming and simple": building some DSL framework means that if we're careful, the "happy path" of adding a new instruction or a new lowering rule is much easier, and also means that there is less cognitive load / fewer subtle invariants to keep in mind. That's pretty much the driving factor I had in mind at least. But I agree that it can make systemic optimizations, like "make the Inst enum smaller", more difficult it it means one has to step up into a meta-level, and it also puts a lot of pressure on the framework to be correct, lest we have difficult-to-find bugs. IMHO the tradeoff is worth it and does make the codebase more welcoming for the most common sort of contribution we anticipate in the future, but... please do discuss!

@cfallin
Copy link
Member Author

cfallin commented Sep 10, 2021

Greetings all! I've been churning on a little prototype of the DSL described in this RFC -- or at least something very close to it, with tweaks as needed as I discovered interesting issues in implementation -- in order to get a feel for its semantics.

For your consideration, a small sketch of an instruction selector in this commit. It doesn't actually do anything yet; it's not wired up to the rest of Cranelift; but it does successfully pass through this prototype ISLE compiler and generate some Rust code.

The little sketch in the commit above has some "prelude" that gives definitions to glue the DSL onto Cranelift, but the interesting bit is in x64/lower.isle -- some sample rules:

(rule
 (Lower (Iadd a (InputConst k)))
 (X86Add (InputToReg a)
         (RegMemImm.Imm k)))
(rule
 (Lower (Iadd (InputConst k) b))
 (X86Add (InputToReg b)
         (RegMemImm.Imm k)))
(rule
 (Lower (Iadd a b))
 (X86Add (InputToReg a)
         (RegMemImm.Reg (InputToReg b))))

(rule
 (Lower (Iadd a (InputSourceInst (Load addr off))))
 (X86Add
  (InputToReg a)
  (RegMemImm.Mem (Mem.BaseImm (InputToReg addr) off))))
(rule
 (Lower (Iadd (InputSourceInst (Load addr off)) a))
 (X86Add
  (InputToReg a)
  (RegMemImm.Mem (Mem.BaseImm (InputToReg addr) off))))

This all can be bikeshedded to fine-tune the actual types and definitions as much as we want; the point is that we have a term-rewriting system, and descriptive terms in the "extractors" (LHS terms) and "constructors" (RHS terms).

To head off any quick reactions to the verbose terms like InputSourceInst and questions such as "why not just (Iadd (Load ...) ...)? I'd point back to the section of the RFC that motivates the "programmable extractors" with examples where we might want more than one such "virtual term" to be possible in a given context.

To the best of my thinking right now, all of these rules are compatible with our main goals -- both quick (single-pass) lowering as compiled by the prototype compiler above, and a value-equivalence-based view of the lowering to aid verification and other formal efforts. (To see how, think of e.g. InputSourceInst as just the identity function -- it's a type-adapter that makes the lowering work but doesn't change the "value".)

Let me know what you think! I am going to context-switch away from this for a little bit but (i) I will plan to braindump a good amount about the internals / implementation details / what I've learned while prototyping the compiler-compiler, and (ii) I do intend the above prototype to be a "real" starting point (modulo some cleanups) that we can develop into the actual implementation, if this RFC is accepted.

@fitzgen
Copy link
Member

fitzgen commented Sep 13, 2021

(rule
 (Lower (Iadd a (InputSourceInst (Load addr off))))
 (X86Add
  (InputToReg a)
  (RegMemImm.Mem (Mem.BaseImm (InputToReg addr) off))))
(rule
 (Lower (Iadd (InputSourceInst (Load addr off)) a))
 (X86Add
  (InputToReg a)
  (RegMemImm.Mem (Mem.BaseImm (InputToReg addr) off))))

Is the idea that the InputSourceInst extractor will fail if the load isn't the same color as this operation and so we don't need to expose the coloring to the DSL, or is the lack of coloring here an oversight/simplification?

Either way, I think that either building the coloring logic into the InputSourceInst extractor or exposing a new extractor specifically for checking color could work here.

@cfallin
Copy link
Member Author

cfallin commented Sep 13, 2021

Is the idea that the InputSourceInst extractor will fail if the load isn't the same color as this operation and so we don't need to expose the coloring to the DSL

Yes, exactly, it's a wrapper (or will be once we write the trait impl) around the LowerCtx::get_input_as_source_or_const method to get the producer for an input, which implements the logic (involving matching color and/or purity) that allows an instruction to be merged in. So it will always fail when we aren't able to merge that far up the tree; all we need to reason about in the DSL is the shape of the tree that we want to match.

@bnjbvr
Copy link
Member

bnjbvr commented Sep 17, 2021

While I think a better way to handle instruction selection and generation is necessary, I have a few concerns about this proposal, especially around this not reaching the ideal goal of simplicity we're looking for, and I am unsure about the use of a custom DSL for this task.

Moving the debate out of the comment, I think this doesn't get us to a better place in terms of simplicity. Simplicity/easiness-to-use is largely subjective anyways, and the creator bias/hindsight bias make it so that a system one creates likely appears as being simple to their own eyes. However, here even after re-reading the proposals a few times, I'm still hesitant about a few concepts that seem to be key to knowing how to do things (extractor/entrypoints/...). Introducing concepts is inevitable of course, but if these are hard to grasp for a majority of current and future contributors, the system wouldn't be very welcoming for users, becoming a large barrier to entry. I grasp from other people asking for detailed examples and supplementary blog posts explaining this new ISLE framework that I might not be the only one having uncertainties here; don't want to read too much into this, because I think I have been the only one expressing explicit concerns of this kind so far. In terms of complexity, I see it like this: adding a new single instruction? Probably fine, I can just copy-paste some other ISLE code, stealing bits here and there. Add a way to find a new kind of patterns? Unclear if I could do it. So adding a whole new target ISA? Sounds scary. All these things feel straightforward to me in the existing code base, but I am myself subject to the creator bias, of course :-) and navigating the code base is pretty easy thanks to the great Rust developer tools.

And that relates to the second point: using a custom DSL for this kind of work would be a clear step backwards, in my opinion. We're talking about a small programming language here, which implementation will be tweaked, which "programs" (ISLE definitions) will need to be debugged, optimized, etc. Moving the old meta framework from Python to Rust was a clear win, because we got static typing which removed a whole range of error-proneness, and a better developer environment in general: being able to share code between codegen and meta, using Rust packages in the meta crate if needs be, using all the developer tools included in IDE (rust-analyzer ❤️). So getting back to a custom DSL, we'd lose debuggers, IDE functionalities, namely jump-to-definition i.e. the ability to look how things are implemented ("what's an extractor? oh I could just jump to its definition, owait no, it's done in a different language at a different level, might as well just grep around in the code base, hoping that i'll find where/how it's done"). This is not intractable of course, and all these tools could be reimplemented but that might be a waste of resources to do so. Any of Rust/Python/TypeScript would be better fits at this point, since they now all have good development ecosystems (with a clear preference for Rust, of course!).

Please bear with me, here: I am fully in favor of a change in the way we're doing instruction selection, understanding all the benefits, but I am not convinced the proposal in its current state would solve our complexity issues.

@cfallin
Copy link
Member Author

cfallin commented Sep 17, 2021

@bnjbvr thank you for taking the time to detail your concerns here!

First, at a high level: I'm glad that we agree that some sort of DSL or systematized approach is necessary. The recent string of SIMD-related fuzzbugs, and other difficulties we've had in the past in communicating some of the subtle requirements and invariants in the current handwritten backends (e.g., regalloc metadata, when it is safe to use registers, when it is safe to sink an instruction, and some of the subtle bugs that have resulted when we get this wrong), have convinced me pretty solidly that we need to automate; a handwritten approach, while tempting below a certain level of intrinsic complexity (number of lowering cases / instructions), is just too error-prone.

There are also the things that we just can't do today, that a systematized DSL would enable: large-scale refactors that alter how the lowering works (e.g., switching to a new regalloc API, or a different emission strategy), combining multiple layers of rewrite rules or manipulating/processing them in some other way (e.g. Peepmatic-style generated simplification patterns), and consuming the lowering patterns for other uses, such as verification. So, in my mind, we need a DSL; the question is just whether this one is the right one.

If I'm reading your main concerns right, I think they distill down to this -- please correct me if I mis-state anything:

  • Discoverability: code written in Rust benefits from all sorts of tooling, and a "what you see is what you get" simplicity. Curious how the compiler handles case X? Just trace through the code, or step in a debugger, or "go to definition". I agree -- this is great. A DSL adds a level of indirection that makes this harder.
  • Complexity of mental model: again in handwritten Rust, we have a "match on opcode and emit()" paradigm. ISLE introduces new concepts, in particular around term-rewriting, extractors and constructors, and the like. This is a barrier to entry.

Both of these are real issues and I think that weighing these against the benefits is largely a subjective question. But I want to dive in a bit on each.

Complexity of mental model: it might be useful to start from a pure term-rewriting system as a neutral, fairly mainstream sort of approach (see e.g. the Go compiler's DSL, or tablegen's expression trees). In your view, is this sort of approach reasonably approachable, or does it suffer from some of the same issues? In other words, I'm trying to distill whether some of the new concepts in ISLE specifically (extractors, single-pass lowering) are the issue, or instead if the issue is just thinking in terms of layered rewrite rules.

If the former, I think we can definitely try to refine the terminology, or give better examples. If the idea of a programmable extractor in particular is ultimately too weird or objectionable, I can imagine some alternative ways of getting the same thing -- maybe we have constructors on the right-hand side instead that can fail to match. If the latter, and all term-rewriting systems raise concerns, then I think we're at more of a philosophical impasse, since that way of representing instruction lowering is so powerful for all the other things we want to do, but I'm still happy to discuss more :-)

Discoverability: this one is a bit harder, and I agree that pure Rust code is the gold standard w.r.t. discoverable definitions and autocompletions. I do want to note re: your listed advantages above that ISLE, too, is strongly statically typed, unlike a lot of other past term-rewriting systems; so it will at least catch type errors, if not proactively help avoid them in an IDE.

I'm curious, though, if you have better ideas to address it -- an EDSL in Rust somehow? If so, I'm not sure how we'd be able to do the transforms we want to do -- if you squint enough, the ISLE compiler is sort of doing a "transpose", in that it's weaving together a bunch of different lowering-pattern rules by their common prefixes. It's thus a kind of global transform, and would probably need a Rust compiler plugin, rather than just e.g. Rust macros that build data structures, as the cranelift-codegen-meta DSL does. I'm very open to suggestions here.

Anyway, that's my braindump for now; hopefully we can come to a better consensus here, or at least distill out the subjective decision points that underlie the space!

@fitzgen
Copy link
Member

fitzgen commented Sep 17, 2021

Hi @bnjbvr :)

First off, I want to recognize that your concerns are valid, particularly around IDEs and developer tooling. I think that is just a fundamental trade off of using a DSL vs open coding, and something we need to explicitly acknowledge and accept if we are going to go down the DSL route.

Aside: Although embedded (within the "main" language) DSLs (EDSLs, as @cfallin mentioned (apologies if you are already familiar with this terminology)) can sometimes have their cake and eat it too with regards to tooling vs DSL. In Rust I think that would probably look like some kind of macro sugar language because I don't think Rust's regular abstraction capabilities would allow us to express the kinds of transformations and combinations of rules that we want to express. However, (a) my experience has been that rust-analyzer isn't great at seeing through macros, and (b) if we use procedural macros (rather than macro-rules macros, which again probably can't do the kinds of transforms/combos we want) then we arguably aren't really doing an EDSL anymore since we are just compiling arbitrary code to Rust in arbitrary ways. So I don't think we can have our cake and eat it too in this particular case.

All that said, as someone who hasn't contributed to the new backends, I personally find ISLE more inviting than the open-coded backends. Many grains of salt here, since I did some research into Go's/GCC's/etc DSLs for instruction selection and peephole optimizations when designing and implementing Peepmatic, so I've already made myself familiar with this term-rewriting paradigm.

It was also helpful to step through a few very small examples with @cfallin the other day. We started with just the rule to lower an arbitrary iadd, putting both operands in registers. Then we inspected the generated Rust code. Next, we added rules for if the first operand is an iconst that fits in an immediate and lowered that into an add-with-immediate, and did the same for the second operand. Then we inspected the generated Rust code for these three, partially overlapping rules. This gave me a lot of confidence in the ISLE compiler, and that it is basically doing what I expect it should. I would recommend this exercise to others as well, and I'm happy to pair with people to go through it, if anyone would like.

@cfallin
Copy link
Member Author

cfallin commented Sep 17, 2021

As a heads-up, I just added a Cranelift biweekly meeting agenda item (edit: for this coming Monday's meeting) to discuss this more -- @bnjbvr, @fitzgen and any others, perhaps we can talk more through these concerns and thoughts then!

@bnjbvr
Copy link
Member

bnjbvr commented Sep 20, 2021

Thanks. As a note, I won't be able to attend this particular meeting because of a conflict.

@cfallin
Copy link
Member Author

cfallin commented Sep 24, 2021

Just coordinated briefly with @bnjbvr and have now added an agenda item for the next (Mon Oct 4) Cranelift meeting -- anyone interested in continuing the discussion please feel free to join!

@fitzgen
Copy link
Member

fitzgen commented Oct 4, 2021

FYI, I have added a tutorial and implementation overview to ISLE's README.md.

@cfallin
Copy link
Member Author

cfallin commented Nov 1, 2021

Hi all -- thanks for the continuing input and patience on this long-running effort!

@fitzgen gave an excellent update on his progress driving forward the prototype Cranelift integration for ISLE in the Cranelift biweekly today, and example implementations of several instruction lowerings. The general consensus from the meeting was that there are no strong objections to moving forward with this. So, with that in mind, I would like to make a...

Motion to finalize with a disposition to merge

Stakeholders sign-off

Arm

DFINITY

Embark Studios

Fastly

Google/Envoy

Intel

Microsoft

Mozilla

IBM

wasmCloud

Unaffiliated

@fitzgen
Copy link
Member

fitzgen commented Nov 1, 2021

Fwiw, here are the slides for the update I gave at today's cranelift meeting: https://docs.google.com/presentation/d/1b6psjaIKkVTZGaGzMcw9h_Ahgih-ctUcnz0Dbjn11GI/edit

@cfallin
Copy link
Member Author

cfallin commented Nov 1, 2021

(sorry @alexcrichton, just noticed your name was missing -- I had copied the checkbox list from your latest RFC's motion-to-finalize comment. added now!)

Copy link
Contributor

@bjorn3 bjorn3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no objections

@cfallin
Copy link
Member Author

cfallin commented Nov 3, 2021

Now that we have signoffs from two separate groups, as per

Once any stakeholder from a different group has signed off, the RFC will move into a 10 calendar day final comment period (FCP), long enough to ensure that other stakeholders have at least a full business week to respond.

we are moving into the Final Comment Period. The FCP will end on Monday, November 15.

@uweigand
Copy link
Member

This is OK with me.

@cfallin
Copy link
Member Author

cfallin commented Nov 15, 2021

Hi all -- now that the Final Comment Period has elapsed and there are no pending objections, it is time to merge this RFC!

The next step will be review and landing of the initial code itself. I plan to put together some language-reference-like documentation for the DSL itself, and I'm happy to help answer questions and onboard folks in general as we transition the codegen backends to this new approach!

@cfallin cfallin merged commit 614b870 into bytecodealliance:main Nov 15, 2021
@cfallin cfallin deleted the cranelift-isel-isle-peepmatic branch November 15, 2021 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants