Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codegen: don't use PSym for locals #914

Merged
merged 16 commits into from
Sep 22, 2023

Conversation

zerbina
Copy link
Collaborator

@zerbina zerbina commented Sep 17, 2023

Summary

Use a dedicated intermediate representation for local var/let,
parameters, and the result variable in the code generator IR.

This is the first step towards removing all PSym usage from the
backend/code-generators. The goal is to further decouple sem from the
code generators, making it possible evolve both (especially their IRs)
more independently from each other.

Details

General design

  • cgir.Local stores all relevant information about a local
  • the information about all locals is owned by and stored as part of
    Body
  • a local is referenced from the CgNode IR via the cnkLocal node,
    which stores a LocalId value
  • the result variable and parameters are also treated as locals.
    In most cases, they don't require any extra handling, so them having
    their own data types and node kinds would only complicate things
  • for simplicity, for the Body of a procedure, the first local slot is
    always reserved for the result variable, even if the procedure has no
    return type. At least for now, this approach makes it easier to
    ad-hoc compute the IDs for parameters from just their position, which
    is still needed in some cases

In terms of data representation, a Local stores the type, alignment,
flags (currently still TSymFlags), and the name (which may be unset).
Unlike the MIR, no special distinction is made between temporaries and
user-defined locals.

In addition, Local also stores whether the corresponding user-defined
local is immutable (the isImmutable field). This additional
information is meant to help the code generators optimize: for example,
the JavaScript code generator doesn't use indirection for let locals,
even if they have their address taken.

For extending the data about locals, each code generator uses an
OrdinalSeq[LocalId, T] where T is a data type tailored to the needs
of the respective code generator. This data-oriented design allows for
easy extension without having to modify the base IR itself.

Overview of the changes

All places where cnkSym nodes were previously handled are adjusted to
also handle cnkLocal nodes. This usually means moving the local/
parameter/result specific handling to a new cnkLocal branch.

Merging two bodies (cgir.merge) also has to take locals into account:
the list with the locals from the source body is appended to that of the
destination body, and the cnkLocal node in the merged source body are
updated.

The code generators (cgen and jsgen) that support incremental code-
generation for procedures need to synchronize their OrdinalSeq with
the body's Store, after a partial procedure was updated. For this
purpose, the relevant code generators are extended with the genPartial
procedure.

Changes to CgNode rendering

  • in order to properly render locals, cgirutils.render now requires a
    Body instead of only a CgNode
  • the name disambiguation logic is generalized to support disambiguating
    the names of PSym- and Local-based symbols (through use of the new
    internal Symbol type)

In the output, compiler-inserted locals are now rendered as
:aux_$LocalId (auxiliary local) instead of :tmp_$id. The
--expandArc-using tests are adjusted accordingly.

Initialization of Locals

The setup of the Local data for a Body happens in cgirgen:

  • locals reach cgirgen as PSyms, meaning that they need to be
    translated. When appearing in 'def's, the symbol of a local is
    translated to a Local, a LocalId is allocated, and a mapping
    between the PSym and LocalId is established (via the localsMap)
  • for procedures, generateIR is responsible for setting up the
    Locals representing the result variable and parameters
  • special handling is required for except T as e, where T is an
    imported exception type: these also act as definitions of new locals

Creating or translating temporaries no longer requires creating new
PSyms, which also means that there's one less source of IdGenerator
interaction in cgirgen. The IdGenerators should ideally be sealed at
the end of semantic analysis, and this is a step towardst this goal.

C code-generator changes

  • instead of in a SymbolMap, the TLoc for locals is stored in
    an OrdinalSeq
  • the special handling to account for definitions of locals being
    visited multiple times for .inline procedures becomes obsolete and
    is removed
  • cnkLocal nodes cannot appear in construction expressions, and
    they thus don't need special handling in hashTree
  • testing whether a local location is uninitialized no longer relies on
    TSym.locId; inspect the TLoc.k value is enough (locEmpty
    signals an empty/uninitialized location)
  • for detecting parameters, reifiedOpenArray now requires access to
    a BProc instance
  • for detecting parameters, the mapTypeChooser template accepting a
    CgNode also requires access to a BProc instance
  • the dispatching based on the specific symbol kind (skParam,
    skTemp, skResult, etc.) becomes obsolete and is removed
  • the parameter and result variable setup in cgen.startProc is
    adjusted
  • the separate handling of parameters when writing NDI files is
    removed; parameters are part of the locals now
  • checking for skTemp symbols when writing NDI files is also not
    necessary anymore: it is replaced with checking whether the local has
    both a user-provided and mangled name.

Name mangling for locals with user-provided names stays the same, but
for locals without user-defined names, an underscore (_) plus the
stringified integer value of the LocalId is now used. Within a
single procedure, this simple mangling guarantees unique identifiers.

Because parameters cannot easily be distinguished from other locals,
compat.isLvalue now always treats them as l-values. This only affects
the code generated (without impacting observable behaviour) for object
up- and down-conversion.

JS code-generator changes

  • the new Loc type represents the code-generator's state for a local.
    It is made up of the mangled name and the location's storage
    description
  • in order to free up the name, the locals field in TProc is renamed
    to defs
  • Loc is intended to be eventually used for globals and constants too
  • computing the storage mode once (StorageFlags) and then storing the
    result in Loc is simpler than storing the symbol flags and kind and
    recomputing the storage mode on each query
  • Locs for locals are initialized (via setupLocalLoc) when a
    definition statement is encountered: these are currently 'def' nodes
    and 'except' branches for imported exceptions
  • since storage relies on this information, a symbol kind has to be
    passed to setupLocalLoc
  • genSymAddr is restructured into the generic addrLoc, which works
    with both Loc and PSym
  • the core part of genSym is extracted into the generic accessLoc,
    which works with both Loc and PSym
  • the interface of genVarInit is adjusted to work for both Loc and
    PSym

Once Loc is used for all location in the JS code generator, the
compatibility and adapter routines can be removed.

VM code-generator changes

  • the new LocalLoc type represents the code-generator's state for a
    local. The type replaces the symbol-id-to-register table (locals)
    and the addressTaken set
  • parameter setup (genParams) becomes simpler: no special handling
    for the result variable and environment parameter is needed anymore
  • genSymAsgn is split into genAsgnToLocal and genAsgnToGlobal
  • to reduce the amount of required changes, genSym and genSymAddr
    still support locals for now. This should be cleaned up separately

This frees up the name `locals` to be used for the storage of local-
related data.
`Local` represents static information about a local variable. `Body`
stores a list of them, and `cnkLocal` references an element in the list
via `LocalId`.
* `treeRepr` now supports `cnkLocal`
* to be able to render references-to-local, `render` now requires a
  `Body` as input
* generalize the existing symbol rendering to also work for non-`PSym`
  symbols
* locals reach `cgirgen` as `PSym`s, meaning that they need to be
  translated. When appearing in 'def's, the symbol of a local is
  translated to a `Local`, a `LocalId` is allocated, and a mapping
  between the `PSym` and `LocalId` is established (via the `localsMap`)
* for procedures, `generateIR` is responsible for setting up the
  `Local`s representing the result variable and parameters
* special handling is required for `except T as e`, where `T` is an
  imported exception type: these also act as definitions of new locals

Creating or translating temporaries no longer requires creating new
`PSym`s, which also means that there's one less source of `IdGenerator`
interaction.
Because parameters cannot easily be distinguished from other locals,
`compat.isLvalue` always treats them l-values. This only affects the
code generated (without impacting observable behaviour) for object up-
and down-conversion when using the C backend.
The procedure implements synchronizing the number of elements between an
`OrdinalSeq` and a `Store`, which is something that the code generators
are going to need to do.
* instead of having to use a symbol-id-to-register table (`locals`) plus
  an additional set (`addressTaken`) for marking locals as requiring an
  indirection, a single `OrdinalSeq` indexed by the `LocalId` is
  sufficient now. The `OrdinalSeq` stores the current state (`LocalLoc`)
  for each local in the processed fragment
* parameter setup (`genParams`) becomes simpler: no special handling
  for the result variable and environment parameter is needed anymore
* `genSymAsgn` is split into `genAsgnToLocal` and `genAsgnToGlobal`
* to reduce the amount of required changes, `genSym` and `genSymAddr`
  still support locals for now. This should be cleaned up separately
* the new `Loc` type represents the code-generator state of a local. It
  is made up of the mangled name and the storage description
* `Loc` is intended to be eventually used for globals and constants too
* computing the storage mode once (`StorageFlags`) and then storing the
  it in `Loc` is simpler than storing the symbol flags and recomputing
  the storage mode on each query
* `Loc`s for locals are initialized (via `setupLocalLoc`) when a
  *definition* statement is encountered: these are currently 'def' nodes
  and 'except' branches for imported exceptions
* since `storage` relies on this information, a symbol kind has to be
  passed to `setupLocalLoc`
* `genSymAddr` is restructured into the generic `addrLoc` that works
  with both `Loc` and `PSym`
* the core part of `genSym` is extracted into the generic `accessLoc`
  that works with both `Loc` and `PSym`
* the interface of `genVarInit` is adjusted to work for both `Loc` and
  `PSym`

Once `Loc` is used for all location in the JS code generator, the
compatibility and adapter routines can be removed.
* instead of in a `SymbolMap`, the `TLoc` for locals is stored in
  an `OrdinalSeq`
* the special handling to account for locals being defined multiple
  times for `.inline` procedures becomes obsolete and is removed
* `cnkLocal` nodes cannot appear in the construction expression, and
  they thus don't need special handling in `hashTree`
* testing whether a local location is uninitialized no longer relies on
  `TSym.locId`, but can just inspect the `TLoc.k`
* for detecting parameters, `reifiedOpenArray` now requires access to
  a `BProc` instance
* for detecting parameters, the `mapTypeChooser` template accepting a
  `CgNode` also requires access to a `BProc` instance
* the dispatching based on the specific symbol kind (`skParam`,
  `skTemp`, `skResult`, etc.) becomes obsolete and is removed
* the parameter and result variable setup in `cgen` is adjusted
* the separate handling of parameters when writing NDI files is
  removed; parameters are part of the locals now
* checking for `skTemp` symbols when writing NDI files is also not
  necessary anymore: it is replaced with checking whether the local has
  both a user-provided *and* mangled name.
When incrementally generating the target-language code for a procedure,
the respective code generator's `OrdinalSeq` storing the state of the
locals needs to be synchronized with the data from `Body`.

The new `genPartial` procedure added to both the C and JS code
generators takes care of this; the VM code generator doesn't use
incremental code generation for procedures and thus doesn't need a
`genPartial` procedure.
Temporaries are now rendered differently.
@zerbina zerbina added refactor Implementation refactor compiler/backend Related to backend system of the compiler labels Sep 17, 2023
@zerbina zerbina added this to the MIR phase milestone Sep 17, 2023
@zerbina zerbina requested a review from saem September 17, 2023 22:06
@mrgaturus
Copy link
Contributor

In the output, compiler-inserted temporaries are now rendered as
:local_$LocalId instead of :tmp$id. The --expandArc-using tests
are adjusted accordingly.

i would prefer :tmp_$LocalId because :tmp has sightly better readability than :local when there are many locals and also may be more specific because those locals are temporal helpers to defined locals before arc expand

i like the idea of _$LocalID postfix because is useful to locate between all expanded sections

@zerbina
Copy link
Collaborator Author

zerbina commented Sep 18, 2023

i would prefer :tmp_$LocalId because :tmp has sightly better readability than :local when there are many locals and also may be more specific because those locals are temporal helpers to defined locals before arc expand

I'm not set on using the :local_ prefix, but :tmp would, in my opinion, be wrong. A local without a user-provided name is not necessarily a temporary variable. :tmp being used previously was more of a misnomer.

For better readability, only using _$LocalId, like how cgen renders them, could be a good alternative.

Copy link
Collaborator

@saem saem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor items that you might want to change; otherwise good to go. Decoupling is going to be a big win.

Admittedly, I did struggle thinking through the broader logical changes, but that's 100% my struggle with focusing these past few days.

compiler/backend/ccgcalls.nim Outdated Show resolved Hide resolved
compiler/backend/cgirgen.nim Outdated Show resolved Hide resolved
@@ -166,6 +166,12 @@ func newSeq*[I; T](x: var OrdinalSeq[I, T], len: int) {.inline.} =
func setLen*[I; T](x: var OrdinalSeq[I, T], len: int) {.inline.} =
setLen(base(x), len)

func synchronize*[I; A; B](x: var OrdinalSeq[I, A], s: Store[I, B]) =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func synchronize*[I; A; B](x: var OrdinalSeq[I, A], s: Store[I, B]) =
func rightSize*[I; A; B](x: var OrdinalSeq[I, A], s: Store[I, B]) =

I think this is slightly more indicative of the change in size and its destructive nature.

Copy link
Collaborator Author

@zerbina zerbina Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe I'm misunderstanding, but I don't think it's a destructive operation. s must be larger than or equal in size to x, so no existing elements are destroyed/cut off.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jeez, i seriously misread it.

rightSize might still fit, but up to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll stay with synchronize for now. To me, rightSize sounds like a query rather than a modification.

@mrgaturus
Copy link
Contributor

For better readability, only using _$LocalId, like how cgen renders them, could be a good alternative.

what about :aux_$LocalId?, i think "auxiliar" is a better meaning about what those locals do

@zerbina
Copy link
Collaborator Author

zerbina commented Sep 21, 2023

what about :aux_$LocalId?, i think "auxiliar" is a better meaning about what those locals do

I like it!

@zerbina
Copy link
Collaborator Author

zerbina commented Sep 22, 2023

/merge

Thank you for the review, @seam, and thank you for the aux suggestion, @mrgaturus.

@github-actions
Copy link

Merge requested by: @zerbina

Contents after the first section break of the PR description has been removed and preserved below:


Notes for Reviewers

@chore-runner chore-runner bot added this pull request to the merge queue Sep 22, 2023
Merged via the queue into nim-works:devel with commit 54d372c Sep 22, 2023
18 checks passed
@zerbina zerbina deleted the codegen-dont-use-psym-for-locals branch September 29, 2023 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/backend Related to backend system of the compiler refactor Implementation refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants