Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal: separate TLoc from TSym and TType #790

Merged
merged 8 commits into from
Jul 9, 2023

Conversation

zerbina
Copy link
Collaborator

@zerbina zerbina commented Jul 8, 2023

Summary

Remove the loc fields from both TSym and TType and leave it to
the code generators how and if they use TLoc -- TSym only stores the
external name plus interface flags now. The benefits:

  • another point of coupling (through data-types) between semantic
    analysis and code generation is removed
  • the size of an TType instance shrinks from 120 to 96 bytes (with a
    64-bit target)
  • the size of an TSym instance shrinks from 168 to 160 bytes

Because of its lode field, TLoc not being defined in the ast_types
module is also an important prerequisite for introducing a code IR for
the code generators.

Details

The key changes:

  • move TLoc and the associated types only relevant to the C code
    generator to cgendata
  • remove the loc field from TSym and TType -- TSym stores the
    external name and interface flags via the new extname and locFlags
    fields
  • as an interim solution, add the locId field to TSym
  • fix the name slot in the ast of dispatcher storing the wrong symbol
  • NDI file generation is now the responsibility of the orchestrator

cgen

For associating a PSym with a TLoc, multiple Stores shared across
all C modules are used. The TSym.locId is an index into the
SymbolMap (which is a Store underneath) for the respective symbol
kind (globals, constants, procedures, fields). While a single
SymbolMap could be used (for the non-procedure entities), the plan is
to use dedicated types for the entities in the future, and using
separate stores is a preparation for that.

In order to not re-mangle parameters every time when emitting a
prototype, the locs for parameters are stored with their procedure.

Since locals are known to not outlive their surrounding procedure, the
locs for them are stored in a dedicated SymbolMap with the procedure
context -- once a procedure context goes away, the locals' TLocs can
be, and are, freed already.

In general, the C code generator is now more strict with who is
responsible for filling-in loc information: for constants and globals,
it must only happen once during definition. Procedures and fields still
need to support ad-hoc loc setup, and fields need special-casing because
of .inline procedures and how finally clauses work (i.e., they're
duplicated at a too late time).

A table-based approach where the symbol IDs are mapped to TLocs would
have also worked, but measurements showed that it was ~1.4% slower.

NDI files

Instead of directly updating the NDI file with a new entry when
generating a mangled name, as was done previously, adding the entries is
now the responsibility of the orchestrator: entries for whole-program
entities are added after code generation has finished -- entries for
locals once code generation for a procedure has finished.

For NDI file generation to work, the symbol. This was already the case
for all entities except constants, so constants now also store a nkSym
node in their lode field.

jsgen

Due to it being less complex than the C code generator, a table-based
approach is okay for the JavaScript code generator. Only the JavaScript
name is relevant to the code generator, so it maps the symbol IDs
directly to strings and doesn't use TLoc. Separate tables are used
for global entities (globals, constants, procedures, and fields),
locals, and parameters.

Store all locs associated with symbols in lookup tables, with `TSym.loc`
only being queried (but never written) for the external name or user-
provided flags.

In order to efficiently associate a `TSym` with a code-generator `TLoc`,
each `TSym` stores a one-based index (`locId`) into one of the code
generator's lookup tables. A table-based approach would have too, but
it's slightly slower (~1.4% when building the compiler itselfs).

In general, `cgen` becomes more strict with locs, now requiring globals,
constants, and parameters to have their loc set-up once at definition
processing time. Procedures, fields, and locals still need to
support ad-hoc loc setup.

Other changes:
- writing the mangled names to NDI files is temporarily disabled
- in symbol definitions contexts, `loc.t` is replaced with the querying
  the symbol's type directly (both are equivalent)
- locs for constants now store a symbol node in their `lode` field
Use a `Table` to map symbols to the corresponding JavaScript name,
which removes all modifications of `TSym.loc` from `jsgen`.

Due the lesser complexity of the JavaScript code generator compared to
the C code generator, a table-based approach works well enough.
`TType` doesn't need a `TLoc` field at all, and `TSym` only needs to
store the external name (`extname`) plus the interface flags
(`locFlags`).

The `TLoc`, `TLocKind`, and `TStorageLoc` types are moved to `cgendata`,
with `TLocFlag` staying in the `ast` module for now.
The orchestrator is now responsible for writing the mangled names to the
NDI files.
This is made possible by them no longer requiring access to an NDI
file.
It stored in the name slot the symbol of the method that the dispatcher
was derived from, which now causes problems. With the fix, for all
routines reaching the backend, `sym.ast[namePos] == sym` *should* be
true.
@zerbina zerbina added refactor Implementation refactor compiler/sem Related to semantic-analysis system of the compiler compiler/backend Related to backend system of the compiler labels Jul 8, 2023
@zerbina zerbina requested a review from saem July 8, 2023 21:46
Copy link
Collaborator

@saem saem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to having this merged in, further breaking dependencies like this is great.

Few minor items, they're mostly typographical and non-blocking.

@@ -355,7 +349,8 @@ proc copySym*(s: PSym; id: ItemId): PSym =
result.magic = s.magic
result.options = s.options
result.position = s.position
result.loc = s.loc
result.extname = s.extname
result.locFlags = s.locFlags
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too big a change to name it extFlags?

a soft suggestion as a fair bit of this is under heavy transition.

TLocFlag* = enum
# XXX: `TLocFlag` conflates two things:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, even more reason to hold off on extFlags until these have been separated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'm going to split the enum up immediately after this PR.

## generated name is to be used
locFlags*: TLocFlags ## additional flags that are relevant to code
## generation
locId*: uint32 ## associates the symbol with a loc in the C code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two symbols can't be associated with the same location, could we piggy back off of itemId?

(I'm guessing this will result in a 'no' once I've read further in the PR)

It would mean we'd have to store something for each symbol (more memory), but that might balance out as we're storing an extra 4 bytes with the current change. The lack of loc information would have to be tracked on the loc data end, which does mean a deref/index-lookup prior to knowing it doesn't exist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revisiting this comment, I'm pretty sure this wouldn't be a good approach at present, better to defer it as the necessary data and consequent layout are likely to change dramatically.

@@ -93,6 +94,14 @@ type

func hash(x: PSym): int = hash(x.id)

proc writeMangledLocals(p: BProc) =
## Writes the mangled names of `p`'s locals to the module's NDI file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice secondary benefit developing over time, as the mangling becomes more and more consistent it's going to make gdb pretty printers' life all that much easier.

compiler/backend/ccgstmts.nim Outdated Show resolved Hide resolved
compiler/backend/ccgtypes.nim Outdated Show resolved Hide resolved
compiler/backend/jsgen.nim Outdated Show resolved Hide resolved
Co-authored-by: Saem Ghani <saemghani+github@gmail.com>
@saem
Copy link
Collaborator

saem commented Jul 9, 2023

/merge

@github-actions
Copy link

github-actions bot commented Jul 9, 2023

Merge requested by: @saem

Contents after the first section break of the PR description has been removed and preserved below:


Notes for Reviewers

@chore-runner chore-runner bot added this pull request to the merge queue Jul 9, 2023
Merged via the queue into nim-works:devel with commit bb4225d Jul 9, 2023
@zerbina zerbina deleted the split-loc-from-symbol branch July 9, 2023 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/backend Related to backend system of the compiler compiler/sem Related to semantic-analysis system of the compiler refactor Implementation refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants