Support arbitrary embedded languages on an inner pass without breaking container syntax #243

AlfishSoftware · 2024-10-08T22:22:16Z

There should be a way to embed languages generically, without having to account for every possible comment/string/etc of that specific language that just so happen to break the unrelated container syntax; and having to workaround that by adding a lot of unrelated "hack" rules to fix it.

Several languages allow specifying arbitrary embedded languages (markdown is an example), and having to account for every language pair combination is bad, when this could very well be solved generically. Syntax of embedded languages should be determined on a second "pass" without breaking the syntax of whatever is delimiting it in the parent language; while still allowing it to override parent escape sequences (e.g. in strings) over the embedded language.

I think this would be the ideal implementation for the best embedded language support.
Allow a subPatterns field (and an optional replacementPatterns field with it) that uses this second-pass logic. They would be mutually exclusive with patterns. This is how it could work when subPatterns is present:

The start..end|while rule is matched first, without considering any sub-patterns or replacement patterns. Let's say the text content between them is all stored into a innerText variable.
Then apply replacement patterns if they exist. They are basically the same as patterns, except they use match and a replaceWith field to specify substitution within innerText. Place the result into a subCode variable. So the sub-patterns will later operate considering these. So, for example, if a < to < substitution occurs, then sub-patterns operate on this new text. This allows you to replace escaping syntax from the parent language before the sub-pattern that includes the embedded language.
- The replaceWith field can have back-references from its match groups. Those can be the literal group text, or the unicode char from the hex or decimal number from the group (for generic unicode escape sequences).
Then apply subPatterns into just the subCode text atomically, on an inner/sub pass.
For any regions of innerText that had replacements, apply the replacement scope name on top of whatever scopes come from the sub-patterns. So this way you can inter-mix escaping syntax of both languages.

Additionally, allow parent back-references in the "include" names, so you can add any arbitrary language ids.

A theoretical example:

{
  "name": "string.quoted.embedded-code.$1.my-lang",
  "begin": "([\\w-]+)`", // group 1 is the language id
  "beginCaptures": {
    "1": { "name": "entity.other.language.my-lang" }
  },
  "end": "`",
  "contentName": "meta.embedded.block.$1 source.$1",
  "replacementPatterns": [
    // $1 would replace with the char in group 1 below literally
    { "match": "\\\\([`\\\\])", "replaceWith": "$1", "name": "constant.character.escape.my-lang" },
    // $h1 could replace with the unicode char from the hex number matched by group 1
    { "match": "\\\\u(\\h{4})", "replaceWith": "$h1", "name": "constant.character.escape.my-lang" },
    // $d1 same as above, but for decimal numbers
    { "match": "\\\\c\\[(\\d+)\\]", "replaceWith": "$d1", "name": "constant.character.escape.my-lang" },
  ],
  "subPatterns": [
    // "include" could allow back-references from the parent begin/match pattern
    // to support arbitrary languages
    { "include": "source.$1" }
  ]
}

This would let you include any arbitrary embedded language without having to know anything about its syntax, and you could even have escaping in the parent language be recognized and everything would just work.

Example code for this theoretical my-lang:
(all escapes are from my-lang, except backslash is escaped twice, for both languages)

json`
{
  "backtick": "\`",
  "backslash": "\\\\",
  "slash": "\u002F",
  "percent": "\c[37]"
}
`

These would not break the syntax in my-lang, as the inner code is isolated:

json`"`
js`//`
cpp`/*`
python`#`
csharp`(`

The text was updated successfully, but these errors were encountered:

AlfishSoftware mentioned this issue Oct 8, 2024

Option to match end before any patterns #139

Open

RedCMD referenced this issue in RedCMD/TmLanguage-Syntax-Highlighter Oct 11, 2024

Improve error handling

07ccecf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support arbitrary embedded languages on an inner pass without breaking container syntax #243

Support arbitrary embedded languages on an inner pass without breaking container syntax #243

AlfishSoftware commented Oct 8, 2024 •

edited

Loading

Support arbitrary embedded languages on an inner pass without breaking container syntax #243

Support arbitrary embedded languages on an inner pass without breaking container syntax #243

Comments

AlfishSoftware commented Oct 8, 2024 • edited Loading

AlfishSoftware commented Oct 8, 2024 •

edited

Loading