Metadata should not depend on (absolute) text spans #11304

4e6 · 2024-10-11T08:14:36Z

User Visible Goal

Let the user "edit source files in an external editor" without completely breaking information persisted in the METADATA section. E.g. without loosing positions, color, etc. of nodes in the graph, state of widgets, etc.

Terminology

The META-DATA section consists of two lines and on the fly mappings that are part of TextEdit requests:

first line defines UUIDs and their source locations - together with on the fly mappings it is called IdMap
second line (nick named "IDE line") contains meta-data (position, color, state, etc.) associated with UUIDs defined in the first line
since Remove expression UUIDs from metadata section of a source file #10182 there also are on the fly mappings transferred as part of TextEdit requests - together with the first line of META-DATA section forming so called IdMap

UUIDs are important for language server protocol. IDE and the engine communicate via language server protocol and they use UUIDs to identify "locations".

Since #10182 the first META-DATA line contains only UUIDs that appear on the second line - e.g. are used by IDE to associate some persistent information with those locations. The rest of the UUIDs is generated on the fly - each TextEdit request can contain an on the fly additions to the first META-DATA line in the file forming new IdMap.

Constraints

as it is hard to estimate what impact of changes to UUID system would have on the language server protocol - keep UUID as they are now
as it is easy to see what impact change of a format of META-DATA section has - redesign the META-DATA section format
as it is possible to design a META-DATA format that is resilient to user changes - just redesign the META-DATA section format so that it does not depend on absolute text spans.
don't use any 3rd party (Y.js) persistence format
whatever happens don't execute the old version of the code, only the lastest, newest one
metadata section should be a comment

That way we can satisfy the user goal without impacting the whole system and keep the change located just to the change of the META-DATA section format. Find two alternative solutions (classical and snapshot) to this problem in the next sections.

Snapshot: Persist Code Twice (obfuscated by base64)

Explained by @kaz at #11304 (comment)

Classical: AST Based Anchor & Local Text Spans

The metadata stored in files (by the IDE) currently only relates to nodes - e.g. to var = expr statements inside of method bodies. Let's use the AST path to such element as an anchor to identify a semantic location in the source code. E.g. each IDE node can be identified using its path in the AST tree. Let's use following format for the anchor: {method pointer}.{variable name}. I.e.

main =
    op1 = expr1

the op1 node can be identified as main.op1.

Local Text Spans

In addition to the above format of AST based anchor identification, we have to have a way to specify an exact location (just like the current system of absolute text spans does). To do so, let's support exact identification by relative text spans. E.g.:

instead of [Span, UUID] pairs (absolute span from the beginning of the file to UUID mapping),
let's use [AST Path, Span, UUID] tripple

AST Path provides an offset neutral location inside of the AST - an anchor resilient to user edits (all but removal or rename of a method or its variable). The local text span allows to fine tune the location to any expression or element inside the nearest AST anchor.

Can it Work?

IDE & language server continue to use UUID in its protocol as usual
New META-DATA section allows to specify all the locations current system can specify
the [AST Path, Relative Span, UUID] tripple remains stable with all the edits not related to the anchor itself or content up to next anchor
the new META-DATA section format will be versioned - once it is found insufficient (for example because of defining patterns on LHS) we design new format

Yes, it is going to work.

The text was updated successfully, but these errors were encountered:

kazcw · 2024-10-11T14:38:52Z

I like the idea of having symbolic source-code references, but there are a lot of syntactic cases that will each need their own solution.

The proposed path type ({method pointer}.{variable name}) can identify a binding, but we have plans to support attaching metadata to any subexpression (e.g. widget picker: #8754). Even today, not every component shown in the graph has a unique binding; a component can be:

a method's return expression
an expression-statement without a binding, other than a return expression
a method argument definition

Another thing to consider is that the LHS of a binding is not strictly an identifier; it is a pattern. I don't think the backend currently supports destructuring-bindings at all, but once that is implemented there won't always be a simple way to stringify the LHS of a binding.

If the goal is storing metadata in a way that it is resilient to external edits, there's a simpler way: We can use the module source code as a map of itself. Include a snapshot of the module alongside the serialized metadata; then to load a module from disk:

Parse the snapshot to an AST; attach the metadata.
Use Ast.syncToCode to update the parse tree to the current source code.

This way we would preserve all metadata, anywhere in the AST.

farmaazon · 2024-10-14T11:20:59Z

Another thing to consider is that the LHS of a binding is not strictly an identifier; it is a pattern. I don't think the backend currently supports destructuring-bindings at all, but once that is implemented there won't always be a simple way to stringify the LHS of a binding.

I think this is not much of a problem: the key of a given metadata may be just the entire binding - assuming every binding must introduce a variable, they would have to be unique anyway.

As for subexpressions: I think we could just design a "breadcrumb" identification of widgets inside an existing node, which could be even a bit smarter than AST crumbs.

The only real problem I see here are the "bindingless" nodes - but here we could make our graph requiring to give them a name when trying to assign any metadata (like position).

If the goal is storing metadata in a way that it is resilient to external edits, there's a simpler way: We can use the module source code as a map of itself. Include a snapshot of the module alongside the serialized metadata; then to load a module from disk:

Parse the snapshot to an AST; attach the metadata.

Use Ast.syncToCode to update the parse tree to the current source code.

This way we would preserve all metadata, anywhere in the AST.

How Ast.syncToCode is resilient to reordering lines inside the definition? This is one of the advantages of storing metadata "by binding".

kazcw · 2024-10-14T16:45:37Z

If the goal is storing metadata in a way that it is resilient to external edits, there's a simpler way: We can use the module source code as a map of itself. Include a snapshot of the module alongside the serialized metadata; then to load a module from disk:

Parse the snapshot to an AST; attach the metadata.

Use Ast.syncToCode to update the parse tree to the current source code.

This way we would preserve all metadata, anywhere in the AST.

How Ast.syncToCode is resilient to reordering lines inside the definition? This is one of the advantages of storing metadata "by binding".

Currently it tracks reordered lines, but not lines that are both reordered and mutated:

enso/app/ydoc-shared/src/ast/parse.ts

Line 890 in d1ee7fa

    
           // Movement matching: For each new tree that hasn't been matched, match it with any identical unmatched old tree.

It would be straightforward to add binding-aware block comparison in order to handle reordered, mutated lines--easier I think than defining an addressing scheme that can identify any of the syntactic constructs we render as components, and any of their subexpression ASTs.

farmaazon · 2024-10-15T08:04:37Z

Well, I think it sounds quite good to me. I would only make sure the code snapshot is "encrypted" for the user, so they won't edit the snapshot instead of the code by accident. Something sort of "compress + base64".

JaroslavTulach · 2024-10-16T12:49:32Z

LHS of a binding is not strictly an identifier; it is a pattern. I don't think the backend currently supports destructuring-bindings at all

Essential part of new meta-data format is identification of its version. It doesn't matter that the format isn't good enough for future evolution of the language/engine. Once it is found insufficient, we will define new format and change its version identification.

edit source files externally without (totally) breaking the METADATA section

We are looking for a fast solution that allows users to edit .enso files in an external editor and load the files back into the IDE without total layout reset.

JaroslavTulach · 2024-10-16T15:00:40Z

Parse the snapshot to an AST; attach the metadata.

@kazcw explained to me:

Include a snapshot of the module alongside the serialized metadata

What is a snapshot of a module?

The idea is that the .enso file would include the source code twice: Once in plain text, externally-editable, and once "armored" (compress and base64 or the like). Then we will always have the IDE's last state to compare to the possibly-externally-edited source.

I see. Such a duplication goes against the attempt to make META-DATA section smaller. Making the enormous meta-data smaller was a huge driver behind

Remove expression UUIDs from metadata section of a source file #10182

We want to make sure the meta-data section is even smaller than right now (ideas described in #7989), not doubling the size of the user code.

farmaazon · 2024-10-16T15:14:27Z

Parse the snapshot to an AST; attach the metadata.

@kazcw explained to me:

Include a snapshot of the module alongside the serialized metadata

What is a snapshot of a module?

The idea is that the .enso file would include the source code twice: Once in plain text, externally-editable, and once "armored" (compress and base64 or the like). Then we will always have the IDE's last state to compare to the possibly-externally-edited source.

I see. Such a duplication goes against the attempt to make META-DATA section smaller. Making the enormous meta-data smaller was a huge driver behind

Remove expression UUIDs from metadata section of a source file #10182

We want to make sure the meta-data section is even smaller than right now (ideas described in #7989), not doubling the size of the user code.

I think a compressed snapshot won't take as much.

Also, the doubling of source code is ok for me - code files aren't particularly big after all. And, in files where every node has metadata attached (position, visualization, color...) it will be hard not to double the code size, actually.

The problem we had was not that the metadata section doubled the size, but that it increased it two orders of magnitude.

jdunkerley · 2024-10-16T15:25:44Z

Parse the snapshot to an AST; attach the metadata.

@kazcw explained to me:

Include a snapshot of the module alongside the serialized metadata

What is a snapshot of a module?

The idea is that the .enso file would include the source code twice: Once in plain text, externally-editable, and once "armored" (compress and base64 or the like). Then we will always have the IDE's last state to compare to the possibly-externally-edited source.

I see. Such a duplication goes against the attempt to make META-DATA section smaller. Making the enormous meta-data smaller was a huge driver behind

Remove expression UUIDs from metadata section of a source file #10182

We want to make sure the meta-data section is even smaller than right now (ideas described in #7989), not doubling the size of the user code.

I think a compressed snapshot won't take as much.

Also, the doubling of source code is ok for me - code files aren't particularly big after all. And, in files where every node has metadata attached (position, visualization, color...) it will be hard not to double the code size, actually.

The problem we had was not that the metadata section doubled the size, but that it increased it two orders of magnitude.

Agree - I'm not worried about making the metadata smaller than it is now. The previous version where it would be multiple kb for a small file was the problem.

The most important goal of this change is to make it more resilient to external edits (we want to enable changing descriptions or editing in VS Code without losing all metadata). Adding versioning should also allow us to evolve it going forward which would be a great win.

If we end up with something where a user could rename a variable in a text editor with find and replace, this would be a fantastic. The original suggestion of reffering to {method}.{variable}#offset or similar could easily allow this.

kazcw · 2024-10-16T15:45:35Z

If we end up with something where a user could rename a variable in a text editor with find and replace, this would be a fantastic. The original suggestion of reffering to {method}.{variable}#offset or similar could easily allow this.

The syncToCode algorithm handles this too. I designed it not just for the code editor but so that we could correctly handle any external edits that occur while the IDE is running. My proposal of saving source code "snapshots" would extend usage of the algorithm we're already using for this purpose to the case where changes happen when the IDE is closed.

jdunkerley · 2024-10-16T17:35:18Z

This feels like a much larger piece of work than the original suggestion.
Ideally this would be delivered in this or the next sprint.

@kazcw how long would take to implement this kind of approach (bearing in mind it couldn't interrupt your work stream you already have)?

And presumably, other than us throwing it away later - the other approach wouldn't stop us doing it later.

kazcw · 2024-10-16T18:08:02Z

@jdunkerley

This feels like a much larger piece of work than the original suggestion. Ideally this would be delivered in this or the next sprint.

@kazcw how long would take to implement this kind of approach (bearing in mind it couldn't interrupt your work stream you already have)?

And presumably, other than us throwing it away later - the other approach wouldn't stop us doing it later.

3 days, at most.

Add a test case ensuring that the `Ast.syncToCode` algorithm is able to maintain AST identities when a binding and its reference are renamed. This is an important case, as mentioned here: #11304 (comment)

kazcw · 2024-10-17T17:12:59Z

Here's my proposal. The idea is to add a layer to achieve edit-resilience, thus it is mostly orthogonal to other planned metadata improvements:

Metadata format:

Add a snapshot field to the IDE metadata object. The field contains a copy of the source code of the module. To prevent users from inadvertently editing the contents of this field, it would be "armored" by an encoding like compress+base64.
- This change is backward-compatible; if the field is absent, we would proceed as we do now (without resilience).

Writing a file (ydoc-server):

The file includes the same source code twice--once plainly (externally-editable), and once in the snapshot field (armored to prevent modification).

Reading a newly-opened file (ydoc-server):

Decode the source code from the snapshot metadata field, and parse it as snapshotAst.
Decode the node metadata, and attach it to snapshotAst [this step is unchanged from our current process; only the origin of the AST is different].
If the module's current code is not identical to the code in the snapshot, the file has been edited externally; in that case:
- Run snapshotAst.syncToCode(currentCode). This will update the AST to correspond to the current source code, while preserving object identities (and thus metadata associations) according to a heuristic comparing the old and new sources.
- Send a text edit to the language server updating the snapshot, and the metadata map. (This is not strictly necessary--if the IDE modifies anything, the file will be updated anyway; but if we update the snapshot even if the IDE doesn't change anything, that improves our ability to reconcile subsequent external changes.)

kazcw · 2024-10-18T15:02:37Z

Here's my proposal. The idea is to add a layer to achieve edit-resilience, thus it is mostly orthogonal to other planned metadata improvements:

Metadata format:

* Add a `snapshot` field to the IDE metadata object. The field contains a copy of _the source code_ of the module. To prevent users from inadvertently editing the contents of this field, it would be "armored" by an encoding like compress+base64.
  
  * This change is backward-compatible; if the field is absent, we would proceed as we do now (without resilience).

Writing a file (ydoc-server):

* The file includes the same source code twice--once plainly (externally-editable), and once in the `snapshot` field (armored to prevent modification).

Reading a newly-opened file (ydoc-server):

1. Decode the source code from the `snapshot` metadata field, and parse it as `snapshotAst`.

2. Decode the node metadata, and attach it to `snapshotAst` [this step is unchanged from our current process; only the origin of the AST is different].

3. If the module's current code is not identical to the code in the snapshot, the file has been edited externally; in that case:
   
   * Run `snapshotAst.syncToCode(currentCode)`. This will update the AST to correspond to the current source code, while preserving object identities (and thus metadata associations) according to a heuristic comparing the old and new sources.
   * Send a text edit to the language server updating the snapshot, and the metadata map. (This is not strictly necessary--if the IDE modifies anything, the file will be updated anyway; but if we update the snapshot even if the IDE doesn't change anything, that improves our ability to reconcile subsequent external changes.)

Note from today's meeting: The engine should not start execution until after the ydoc-server has updated the metadata map (which might cause UUID changes), or determined that no update is needed. I think this the process I described above ensures this (but this should be kept in mind when implementing): The ydoc-server should process the file fully (and send any necessary edit to the backend) before sharing the AST with the GUI, requesting execution, or doing anything else with the source.

enso-bot · 2024-10-22T20:50:40Z

Dmitry Bushev reports a new STANDUP for yesterday (2024-10-21):

Progress: Updated the stored metadata with the new snapshot field. Testing in gui. It should be finished by 2024-10-28.