RFC: AST node IDs #5689
Replies: 9 comments 17 replies
-
Previous comment from @TzviPM
|
Beta Was this translation helpful? Give feedback.
-
Regarding the parser improvements, a concern has been raised in Discord regarding binary size increases. |
Beta Was this translation helpful? Give feedback.
-
@Boshen, @rzvxa and I had a meeting earlier today. We reached consensus between the 3 of us to follow the plan discussed above at least up to end of Phase 1. I've tweaked the text above to reflect a couple of conclusions we came to in that meet ( We plan to start implementing in a few days time. But hope others who may have comments have time to give feedback in next couple of days. |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
Please disregard my previous comment! We need both of them, I wasn't thinking clearly. However, Maybe we should implement this second ast builder as part of the |
Beta Was this translation helpful? Give feedback.
-
Prior art: rustc uses pointers as node identifiers and they are using them as keys in HashMap with FxHasher. |
Beta Was this translation helpful? Give feedback.
-
Does it make sense to add the id generator to
|
Beta Was this translation helpful? Give feedback.
-
In my opinion, we should make the Only This will prevent user from creating arbitrary |
Beta Was this translation helpful? Give feedback.
-
The (draft) PR for node counting in the parser has been submitted: #6083 |
Beta Was this translation helpful? Give feedback.
-
We have decided to add
NodeId
s to AST node types. Here are my suggestions on how we should approach this.Motivating use cases
walk_statement
inwalk_statements
#4767).I feel storing data out of band is the strongest motivation. There are viable alternatives for the other 3:
Traverse
does).Blocking issues:
InputOptions.inject
rolldown/rolldown#1578constructor-super
. #3780Related discussions:
Previous attempts:
ast_node_id
field to all ast nodes #2818AstNode
trait and provide access toAstNodeId
#2861Proposed design
What kind of IDs?
I suggest that we have a single node ID for all nodes, rather than separate IDs (
StatementId
,ExpressionId
etc). This aligns with the existingAstNodeId
.We may want in future to split into multiple IDs (
StatementId
etc). But for first implementation, a single ID is simpler.What to call it?
NodeId
.AstNodeId
seems overly verbose. What other kinds of nodes do we have?AstNodeId
. refactor(semantic): s/AstNodeId/NodeId #5740What gets an ID?
Only structs should have an ID. Enums e.g.
Statement
should not itself have an ID. But all the enum's variant "payloads" (e.g.BlockStatement
) will have an ID. Rationale:AstNodeId
.Statement
orExpression
to have its own ID.GetNodeId
implemented onStatement
etc to get its "payload" type's ID.In my opinion, best to not add IDs to enums initially, and we'll discover if that blocks any use cases, and if the workarounds to support those use cases are painful/expensive.
ID type
Currently our IDs are wrappers around a
NonMaxU32
. That's motivated by makingOption<Id>
4 bytes (vsOption<u32>
which is 8 bytes).This imposes some costs:
id = internal_id ^ u32::MAX
), but in some cases may have further knock-on effects. A small overhead, but it's a very hot path e.g. in linter.Some
, but constantly have to callunwrap
onOption<Id>
s. This complicates our code, and adds panicking branches to a lot of code.nonmax
crate.I propose an alternative:
0
as a sentinel value equivalent toOption::<NodeId>::None
.Option<NodeId>
in places where extra bytes are undesirable, we can get a niche inNodeId
with a differentNonMaxU32
implementation.(Ideally we would also change
ScopeId
,SymbolId
andReferenceId
to be stored as plainu32
s, instead ofOption<NonMaxU32>
s, for same reasons as above)When to create IDs
This is the tricky part.
Ideally we'd create IDs in parser and set the
node_id
field on AST nodes when they are created. This:node_id
field twice (dummy value first, then real value later).node_id
fields can be a plainNodeId
, notCell<NodeId>
.However, this is not currently feasible. We'll need to generate
NodeId
s inSemanticBuilder
initially.Why not in parser?
1. Order of IDs
Currently
AstNodeId
s are in traversal order.Program
is node ID 0. So iterating overAstNodes::nodes
gives you nodes in traversal order.But the parser creates nodes in a different order.
Program
is the last node to be created, so would have the highestNodeId
.Note: The parser does not create nodes in reverse travesal order either. Numbered in order of node creation:
When traversing this AST, node IDs in visitation order are: 5, 3, 1, 2, 4.
It is unclear if any linter rules rely on current iteration order of
AstNodes::nodes
or not. If they do, they can be refactored so they don't, so this isn't a deal-breaker, but we'd need to do that first.Don thinks that visiting nodes in parse order rather than traversal order might actually turn out to be a perf gain, as nodes would be visited in same order that they're layed out in arena - perfect access pattern for CPU caches.
2. Unused node IDs
This is the hard problem.
In some places (e.g. when parsing what may or may not be an arrow function) the parser creates nodes, but then may rewind and discard them, and then re-parse that section of code again. If parser creates a
NodeId
for every AST node created (e.g. inAstBuilder
), this would result in gaps in the sequence ofNodeId
s which appear in the AST.Why is this a problem?
Later on
SemanticBuilder
will readNodeId
s from the AST and insert data into side arrays indexed byNodeId
(AstNodes::nodes
,AstNodes::parent_ids
).If we cannot guarantee that there are no gaps in the sequence of
NodeId
s, then these side arrays are likely to end up containing sections of uninitialized bytes. There is nothing to stop user creating aNodeId
from an arbitraryu32
and then reading fromAstNodes::parent_ids
orAstNodes::nodes
with thatNodeId
. If thatNodeId
hits an uninitialized "gap", then that's instant UB. Bounds checks cannot prevent this, as gaps will be in middle of the sequence, so theNodeId
can be in bounds, but still read uninitialized bytes.How can we solve this?
SemanticBuilder
which IDs have been seen, and where gaps are. At the end ofSemanticBuilder
's traversal, fill any gaps with zeros, so no uninitialized gaps remain. Unfortunately, because IDs will be encountered out of order, keeping track of what ranges of IDs have been seen, and where gaps are is non-trivial, and probably expensive.bool
s to represent "node exists" or "node does not exist" for eachNodeId
. Initialize it with zeros, and set each entry totrue
when thatNodeId
is found. This array is onlybool
s so will not take as much memory as the main arrays, and so is cheaper to set up. But downside is it imposes a cost on every read from the arrays - now every lookup intonodes
orparent_ids
needs to first check thenode_exists
array, to make sure the node exists before reading fromnodes
.All of these are possible, but all come with a significant performance penalty.
I can imagine these replies:
NodeId
s and tests will catch any bugs." I don't think that's a good idea. Undefined behavior is not a normal bug. It can cause strange behavior in unrelated and unexpected places. It may even cause tests to pass when they shouldn't, and UB is extremely hard to debug. e.g. inserting adbg!()
can magically make the problem you're trying to track down disappear. We've seen exacty this problem in transformer due toAstBuilder::copy
.NodeId
sequence". We can try to do that, but can we statically prove it's impossible? I don't think we can, in which case we open the door to UB.NodeId
s, so anyNodeId
that exists is always valid". Yes, but nothing prevents using aNodeId
from one AST to index into the arrays relating to another AST. UB!There is one possibility which could work, but is quite a sizeable undertaking (see below).
3. Cost of growing
Vec
sLet's say we are creating
NodeId
s in parser, and also building the side arrays e.g.parent_ids
in parser too.How many nodes are there going to be? It's impossible to know at the outset. #4367 demonstrated that growing large
Vec
s because you didn't allocate enough capacity initially is very costly. And for large source files, the number of AST nodes can be huge (100,000+) so theseVec
s are big.Maybe we can just allocate capacity initially to the number of bytes in source text + 2 (I think that's the maximum number of nodes you can get per byte of source e.g.
0+0+0
= 5 bytes, 7 nodes). That will be way too big for normal JS code, but it guarantees no reallocations. Then shrink the arrays to fit after parsing is complete, to free the excess memory again.We would need to check:
So maybe this isn't a deal-breaker, but there are potential challenges.
What to do
I suggest that we tackle this in 3 phases:
Phase 1: Add
NodeId
snode_id
fieldsAdd
node_id: Cell<NodeId>
fields to all AST types.In my view, the fields should be visible in the type defs and not "hidden away" by adding them in a macro. So we'd need to add them manually. But we could do it in
#[ast]
macro initially while we're testing this out.node_id
would ideally be the first field on each type (beforespan
).GetNodeId
traitSimilar to
GetSpan
. We can#[generate_derive(GetNodeId)]
.If we make
node_id
the first field on each type, because the AST is now#[repr(C)]
, the memory offset of the field is guaranteed the same for every type. This makesGetNodeId::node_id
extremely cheap even for the massive enums likeExpression
, as compiler sees it needs no branches to fetch the ID: https://godbolt.org/z/PGfWx6T572nd
AstBuilder
for transformerCreate a 2nd
AstBuilder
for use in transformer + minifier (AstNodeBuilder
?).NodeId
.node_id
on new nodes.Copy
, as needs to be used as&mut AstNodeBuilder
.TraverseCtx
, so only a single instance exists - noRefCell
s or unsafe required.If we don't pass the "parser" version of
AstBuilder
to any transformers, then they have to create nodes viaAstNodeBuilder
, making it impossible to accidentally create any AST nodes without an ID.AstNodeBuilder
can also help us get other things right in the transformer. It can enforce that:IdentiferReference
s must have a validReferenceId
.BindingIdentifier
s must have a validSymbolId
.scope_id
field must be created with a validScopeId
.Make all AST node creation happen via an
AstBuilder
#[non_exhaustive]
. This means they cannot be constructed by code outside of theoxc_ast
crate.oxc_ast
which construct AST nodes except inAstBuilder
(e.g.ForStatement::new
).IdentifierReference {span, name, reference_id}
) withAstBuilder
calls.If we did not create the 2nd
AstBuilder
for transformer yet, we'll need to add methods to currentAstBuilder
for the transformer to use:IdentifierReference
with areference_id
.BindingIdentifier
with asymbol_id
.scope_id
.Without these methods, we won't be able to transition all the transformer over to creating all AST nodes via
AstBuilder
.Other IDs
I think we should leave the other ID fields (
scope_id
,symbol_id
,reference_id
) present in AST, as they are now. @rzvxa suggested accessing them via side tables indexed byNodeId
. I don't think this is ideal because:ScopeId
etc via a side table adds an extra indirection.SymbolId
.ScopeTree
+SymbolTable
, notNodes
.Phase 2: Optimize
Type layouts
Because
NodeId
is au32
, the field order of types will not be optimal, and in most types it will have 4 bytes padding after it. So every AST type will grow by 8 bytes.To fix this, optimize field order in
oxc_ast_tools
, filling the 4-byte gap afterNodeId
with otheru32
/bool
/ single-byte enum fields. As many types currently have 4 or more bytes padding, rearranging the fields will win back that 8 bytes in most cases, and addingNodeId
will be free for many/most types.AstNodes
SoAConvert
AstNodes
into full struct-of-arrays (splitAstNode
up).AstNode
will no longer be required (though we may need to keep it around until we've transitioned the linter to stop using it).Access methods on
NodeId
Rather than getting info about a node via methods on
Semantic
, I propose that we implement methods onNodeId
itself.Along the same lines as: https://github.com/oxc-project/backlog/issues/99
I feel this is cleaner and easier to read. It also avoids the huge pile-up of verbosely-named methods on
Semantic
.Count nodes, scopes, symbols, references in parser
SemanticBuilder::build
traverses the entire AST to count nodes, scopes, symbols and references. These counts are used to allocate sufficient capacity inNodes
,ScopeTree
andSymbolTable
to avoid thoseVec
s growing during the main traversal.oxc/crates/oxc_semantic/src/builder.rs
Lines 223 to 238 in 5482e73
Move this logic into parser. @DavidHancu mentioned on Discord that he has a working implementation, but he's not submitted a PR yet.
Probably we could implement the counters in
AstBuilder
.The tricky part will be getting an accurate count when the parser rewinds - currently counts will be too high because some nodes are created and then discarded again.
Or possibly we could just allocate too much capacity (source text length + 2), and then shrink the arrays to fit once we know the real number (see "3. Cost of growing
Vec
s" section above). An investigation into this option: #5703Phase 3: Move generating
NodeId
s into parserTODO: I'll write this up later.
Beta Was this translation helpful? Give feedback.
All reactions