On the design of stable AST IDs #1422

kyouko-taiga · 2024-04-01T18:57:31Z

kyouko-taiga
Apr 1, 2024
Maintainer

I'm opening this thread to keep a record of my thoughts about possible approaches to make AST IDs stable in hc's implementation (and perhaps nerd-snipe someone to pick up where I left).

The short description of the problem is that we must be able to load a compiled module from disk and use it as a dependency of some other module. The canonical example is the standard library. We want to compile the standard library once and then re-use it as a dependency of any program we write. To do that, we need a "stable" way to refer to entities across compilations. In other words, we need identifiers that do not depend on the order in which declarations are parsed or the order in which module dependencies are loaded.

We have to be a little more precise about what we mean by "compiling a module", though, because inlining and monomorphization is part of the language. Going back to our canonical example, for instance, it is clear that in most cases we want user code to inline many things from the standard library (e.g., Int.infix+). So before we continue, let's look at what we can expect from a compilation.

There are at least 3 artifacts we can produce after compiling a module:

A binary file (executable or library)
A "refined" intermediate representation (i.e., IR that went through all mandatory passes)
A typed AST

The binary file is the "final" artifact of the compilation and I suppose that is the product we want to be able to reuse without re-compiling anything. That's why we want an ABI. The two other artifacts are intermediate results but I think there are quite valuable because type checking and IR lowering are quite expensive operations.

ID stability is an open issue for IR and AST artifacts. The problem is more or less solved at the binary level because we're dealing with mangled symbols, which are already stable. (e.g., Hylo.Int is always spelled aTR2Z). In an AST or in the IR, however, IDs depend on the parser.

In the current implementation, files are processed in some deterministic order (our parser isn't parallel) to iteratively construct an AST. Each node of the AST gets an idea which is simply its position in an array. The ID of a particular node remains unchanged between two compilations iff nothing in the source syntax changes. It is possible to load a module from disk but only if all dependencies are loaded in the exact same order (and source syntax didn't change).

AFAICT, the last bit is the problem we have to solve. Given modules A, B, C such that A depends on B and C, it should be possible to load B or C in any order and it should be possible to separately compile B and C before using them in A.

One approach is to keep parsing files in a deterministic order but assign IDs starting from 0 for each distinct module, adding a unique identifier for the module as a prefix. In other words, a node ID would be a pair (m, n) where m is the unique identifier of its module and n is some 0-based index in the collection of nodes representing the whole module.

The next challenge is to come up with unique module identifiers. We could use a String but I worry that it will be fairly inefficient. Node IDs are used pretty much everywhere and so adding a string to their representation would blow their memory footprint and negatively equality checks. Instead, I think we can just pick some random integer. That obviously opens the door to collisions, but it should be easy to detect those. Then the simplest cure would be to re-compile.

dabrahams · 2024-04-01T21:42:55Z

dabrahams
Apr 1, 2024
Maintainer

I'll repeat what I said before: Swift must have answers to all of these questions, and we should at least know what they are before we try to come up with something better.

3 replies

kyouko-taiga Apr 1, 2024
Maintainer Author

I have a rough understanding of the picture Swift proposes for the ABI but it is far less clear to me what happens at higher levels and I suspect it's far more specific to swiftc's internals.

dabrahams Apr 1, 2024
Maintainer

I don't think your suspicions should stand in the way of an investigation. What if they're not borne out?

kyouko-taiga Apr 1, 2024
Maintainer Author

I'm not saying they should stand in the way ;)

dabrahams · 2024-04-01T22:06:43Z

dabrahams
Apr 1, 2024
Maintainer

The problem is more or less solved at the binary level because we're dealing with mangled symbols, which are already stable. (e.g., Hylo.Int is always spelled aTR2Z).

It isn't obvious to me that it's solved if there are ways to include multiple modules built with the same original name (e.g. Utils) in a program.

In an AST or in the IR, however, IDs depend on the parser

Once name resolution is complete (i.e. in the IR) can't IDs be independent of the parser? Or do IDs in the IR apply to things with no name?

files are processed in some deterministic order (our parser isn't parallel) to iteratively construct an AST. Each node of the AST gets an idea which is simply its position in an array. The ID of a particular node remains unchanged between two compilations iff nothing in the source syntax changes.

I think it should be easy enough to create IDs that are resilient to reordering sources FWIW if we wanted to parse in parallel. This is a point where we ought to be considering the needs of LSP support and more stability may be important.

it is possible to load a module from disk but only if all dependencies are loaded in the exact same order (and source syntax didn't change).

Do we really even have modules yet? Last I checked it was possible to load an AST from disk, but that was always a monolith.

3 replies

kyouko-taiga Apr 1, 2024
Maintainer Author

It isn't obvious to me that it's solved

True. There would be a conflict here.
We can mangle some resource identifier in the prefix of public symbols (e.g., org.hylo-lang.utils) and reasonably hope that such identifiers are unique enough. Of course that isn't foolproof. I did say that the problem was more or less solved ;)

Or do IDs in the IR apply to things with no name?

Yes, IDs are used to identify lambdas, just on the top of my head.

We can come up with different IDs for anonymous constructs but that won't help stability. FWIW, I checked in Swift and (as expected), the ID of a lambda changes if sources change, both at the ABI and IR level.

I think it should be easy enough to create IDs that are resilient to reordering sources

You make a good point about LSP support. We should think about incremental compilation too. Note that resilience in this context would also have to consider adding/removing files.

Do we really even have modules yet?

It's true we don't have modules but we're really almost there. The only significant obstacle is to be able to serialize just a portion of a typed program, which I think isn't too hard if we redefine property maps to keep track of modules. Specifically, we should go from NodeID -> T to ModuleID -> NodeID -> T.

dabrahams Apr 1, 2024
Maintainer

The only significant obstacle is to be able to serialize just a portion of a typed program

Eh, but doesn't that beg the question of ID representation? And anyway, isn't it /de/-serialization that's the big problem, e.g. how do you reassemble an AST when one or more modules change?

kyouko-taiga Apr 2, 2024
Maintainer Author

Eh, but doesn't that beg the question of ID representation?

What we can almost do is have a compiled artifact that contains all the dependencies of some program, load it, and then keep on compiling additional stuff. That is almost what we do currently with the standard library, which is a dependency of everything, except that we only load the syntax and not the property maps nor the IR. The point is that given a sequence of dependencies, we're not far from being able to splice a module from the sequence and reload it later.

Getting to this milestone would already mean that we'd no longer have to re-check the standard library in every test, which I think would slash the time it takes for our test suite to go through by at least 2/3. So even without any change in ID representation, this limited support for modules would still be quite useful.

From there, the issue is that we'll be unable to use modules unless we always use them in a compatible sequence of the dependencies. That won't fly if we want to share pre-compiled artifacts across different projects. It's convenient that the standard library is loaded first, so it's always the first element of the sequence, but we'll quickly run out of luck for other modules.

Let's illustrate:

We have a library C that depends on libraries A and B, which are independent. We want to:

Separately compile A and B and store them for reuse
Load A and B from disk
Compile C
(optional) Modify A without recompiling everything

Step (1) and (2) are not possible without "stable" IDs. Even if we fix a particular order, we would be unable to re-use A or B in any other setup not compatible with this order (e.g., E depends on A and D, the latter of which was compiled as the unique dependency of another project.)

For (1), it's almost certain that the solution is to tell AST nodes apart by their module. So let's assume we can define IDs in a separate "address space" for each module. We can imagine defining an "address space" for IDs in a relatively simple way. For example we can take 20 bits from the 64 we currently use and say that they represent the module identifier. We can pick this identifier at random.

A typical application will not depend on thousands and thousands of modules. To get a sense of the ballpark I looked at the average dependency graph of a nodejs application, which in my experience tend to have ridiculously many dependencies. We're talking about ~600 packages. The probability of a collision with those numbers is vanishingly small. We can always trigger recompilation of the conflicting artifacts if it does occur anyway.

So in my example, we can separately compile A and B using this technique:

We compile A and produce IDs (m₀, 0), (m₀, 1), ...
We produce artifacts extracting only the properties that relate to nodes in m₀
We compile B and produce IDs (m₁, 0), (m₁, 1), ...
We produce artifacts extracting only the properties that relate to nodes in m₁

Now we have two "compiled" modules and we can load them from disk in any order to extend a typed program. That solves (2).

Things get thorny for (3). The problem is that the approach I described so far only works if we can guarantee that all compiled artifacts agree on the address space of their shared dependencies. Otherwise the type checker would get confused (e.g., Hylo.Int is referred to as (m₂, 123) in the API of module A and (m₃, 123) in the API of module B).

We can maintain this guarantee if we're compiling everything on the same machine. We can keep a local cache of compiled modules and assemble dependencies to compile new ones. In my example, when we first compile the standard library we pick mₖ and then both A and B will inherit that choice. So we can solve (3) as long as we don't consider things downloaded from the Ether.

That brings us to the first hurdle: How can we guarantee address space consistency across shared dependencies?

This problem will be common if we pick IDs at random to distribute compiled artifacts. At the very least, all modules will disagree on the address space for the standard library. We can mitigate the issue by reserving specific IDs for known libraries. We can even imagine a repository where "published" artifacts get a reserved ID, which is essentially adapting the local solution that's sketched above. I'll wave my hands around issues related to maintaining this repository and let people compile stuff without Internet access for the time being.

If nothing else, we can trigger recompilation when we detect that dependencies do not agree on their shared dependencies' address spaces. Above, for example, if the compilation of A picked m₂ for the standard library but that of B picked m₃, we ask either to recompile so that they agree.

The next hurdle relates to (4): How can we avoid recompiling everything when sources change?

My approach requires recompilation of a module and all its dependencies when its sources change. But I think that is somewhat expected because of inlining and monomorphization. The ABI would not change, though. I guess the question here is about guarantees on API stability and what those mean for ID resilience.

I have to learn more from Swift in that area but my intuition is that we can't avoid recompilation if sources changes and I'm not even sure we should. AFAIU, the things we want to guarantee is that if we have a compiled binary, we can update it without recompiling anything. This idea does not take API changes into account. If we add a new function in the API without breaking ABI, it is still possible that recompiling the whole thing would actually produce different behavior. So the only case where we'd want to support a change in sources without requesting recompilation is when the change would not break API guarantees, such as the result of type inference for a particular expression.

So maybe we can reformulate the hurdle as such: How can we generate IDs that are resilient to changes not impacting API?

One idea would be to first parse the public API of a module and then parse everything else. Then the AST for the module would be a sequence (m₀, 0), ..., (m₀, a), ... (m₀, b) where nodes in the range 0 ..< a are part of the API and nodes in the range a ..< b are free to change without causing recompilation.

That's as far as I went, not having studied what Swift does. But before I do, I think we should establish all the scenarii of interest and the guarantees that we want to offer in each of them.

dabrahams · 2024-04-02T23:39:18Z

dabrahams
Apr 2, 2024
Maintainer

The point is that given a sequence of dependencies, we're not far from being able to splice a module from the sequence and reload it later.

Well I guess I can't really argue with that because “not far” is subjective, but it seems to me there's a lot of groundwork to lay first and a substantial rework of some of our data structures.

(optional) Modify A without recompiling everything

Maybe optional for now, but it must be designed-for up-front.

separate "address space" for each module.

Just reinventing keypaths in a compressed form.

We can pick this identifier at random.

Not on every compilation if you're ever going to uphold no. 4 above.

IMO we should take the same approach Swift evidently does: depend on unique module names at the base level, hash the module name to get a module identifier, and implement the module renaming strategy described in the Swift proposal, which surely remaps IDs at load time.

we can't avoid recompilation if sources changes and I'm not even sure we should.

? Then we have no resiliency whatsoever.

If we add a new function in the API without breaking ABI, it is still possible that recompiling the whole thing would actually produce different behavior.

Yes, thanks to overloading. But it is possible to evaluate an API change and know whether that's possible. It's also possible to build a tool that checks to see if any calls would be changed by recompilation.

That's as far as I went, not having studied what Swift does. But before I do, I think we should establish all the scenarii of interest and the guarantees that we want to offer in each of them.

Personally, I think we should start with the scenarii that Swift covers and ask ourselves if anything is missing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Hylo Group

On the design of stable AST IDs #1422

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

The Hylo Group

On the design of stable AST IDs #1422

kyouko-taiga Apr 1, 2024 Maintainer

Replies: 3 comments · 6 replies

dabrahams Apr 1, 2024 Maintainer

kyouko-taiga Apr 1, 2024 Maintainer Author

dabrahams Apr 1, 2024 Maintainer

kyouko-taiga Apr 1, 2024 Maintainer Author

dabrahams Apr 1, 2024 Maintainer

kyouko-taiga Apr 1, 2024 Maintainer Author

dabrahams Apr 1, 2024 Maintainer

kyouko-taiga Apr 2, 2024 Maintainer Author

dabrahams Apr 2, 2024 Maintainer

kyouko-taiga
Apr 1, 2024
Maintainer

Replies: 3 comments 6 replies

dabrahams
Apr 1, 2024
Maintainer

kyouko-taiga Apr 1, 2024
Maintainer Author

dabrahams Apr 1, 2024
Maintainer

kyouko-taiga Apr 1, 2024
Maintainer Author

dabrahams
Apr 1, 2024
Maintainer

kyouko-taiga Apr 1, 2024
Maintainer Author

dabrahams Apr 1, 2024
Maintainer

kyouko-taiga Apr 2, 2024
Maintainer Author

dabrahams
Apr 2, 2024
Maintainer