On the design of stable AST IDs #1422
Replies: 3 comments 6 replies
-
I'll repeat what I said before: Swift must have answers to all of these questions, and we should at least know what they are before we try to come up with something better. |
Beta Was this translation helpful? Give feedback.
-
It isn't obvious to me that it's solved if there are ways to include multiple modules built with the same original name (e.g.
Once name resolution is complete (i.e. in the IR) can't IDs be independent of the parser? Or do IDs in the IR apply to things with no name?
I think it should be easy enough to create IDs that are resilient to reordering sources FWIW if we wanted to parse in parallel. This is a point where we ought to be considering the needs of LSP support and more stability may be important.
Do we really even have modules yet? Last I checked it was possible to load an AST from disk, but that was always a monolith. |
Beta Was this translation helpful? Give feedback.
-
Well I guess I can't really argue with that because “not far” is subjective, but it seems to me there's a lot of groundwork to lay first and a substantial rework of some of our data structures.
Maybe optional for now, but it must be designed-for up-front.
Just reinventing keypaths in a compressed form.
Not on every compilation if you're ever going to uphold no. 4 above. IMO we should take the same approach Swift evidently does: depend on unique module names at the base level, hash the module name to get a module identifier, and implement the module renaming strategy described in the Swift proposal, which surely remaps IDs at load time.
? Then we have no resiliency whatsoever.
Yes, thanks to overloading. But it is possible to evaluate an API change and know whether that's possible. It's also possible to build a tool that checks to see if any calls would be changed by recompilation.
Personally, I think we should start with the scenarii that Swift covers and ask ourselves if anything is missing. |
Beta Was this translation helpful? Give feedback.
-
I'm opening this thread to keep a record of my thoughts about possible approaches to make AST IDs stable in
hc
's implementation (and perhaps nerd-snipe someone to pick up where I left).The short description of the problem is that we must be able to load a compiled module from disk and use it as a dependency of some other module. The canonical example is the standard library. We want to compile the standard library once and then re-use it as a dependency of any program we write. To do that, we need a "stable" way to refer to entities across compilations. In other words, we need identifiers that do not depend on the order in which declarations are parsed or the order in which module dependencies are loaded.
We have to be a little more precise about what we mean by "compiling a module", though, because inlining and monomorphization is part of the language. Going back to our canonical example, for instance, it is clear that in most cases we want user code to inline many things from the standard library (e.g.,
Int.infix+
). So before we continue, let's look at what we can expect from a compilation.There are at least 3 artifacts we can produce after compiling a module:
The binary file is the "final" artifact of the compilation and I suppose that is the product we want to be able to reuse without re-compiling anything. That's why we want an ABI. The two other artifacts are intermediate results but I think there are quite valuable because type checking and IR lowering are quite expensive operations.
ID stability is an open issue for IR and AST artifacts. The problem is more or less solved at the binary level because we're dealing with mangled symbols, which are already stable. (e.g.,
Hylo.Int
is always spelledaTR2Z
). In an AST or in the IR, however, IDs depend on the parser.In the current implementation, files are processed in some deterministic order (our parser isn't parallel) to iteratively construct an AST. Each node of the AST gets an idea which is simply its position in an array. The ID of a particular node remains unchanged between two compilations iff nothing in the source syntax changes. It is possible to load a module from disk but only if all dependencies are loaded in the exact same order (and source syntax didn't change).
AFAICT, the last bit is the problem we have to solve. Given modules
A
,B
,C
such thatA
depends onB
andC
, it should be possible to loadB
orC
in any order and it should be possible to separately compileB
andC
before using them inA
.One approach is to keep parsing files in a deterministic order but assign IDs starting from 0 for each distinct module, adding a unique identifier for the module as a prefix. In other words, a node ID would be a pair
(m, n)
wherem
is the unique identifier of its module andn
is some 0-based index in the collection of nodes representing the whole module.The next challenge is to come up with unique module identifiers. We could use a String but I worry that it will be fairly inefficient. Node IDs are used pretty much everywhere and so adding a string to their representation would blow their memory footprint and negatively equality checks. Instead, I think we can just pick some random integer. That obviously opens the door to collisions, but it should be easy to detect those. Then the simplest cure would be to re-compile.
Beta Was this translation helpful? Give feedback.
All reactions