Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary embedded languages on an inner pass without breaking container syntax #243

Open
AlfishSoftware opened this issue Oct 8, 2024 · 0 comments

Comments

@AlfishSoftware
Copy link

AlfishSoftware commented Oct 8, 2024

There should be a way to embed languages generically, without having to account for every possible comment/string/etc of that specific language that just so happen to break the unrelated container syntax; and having to workaround that by adding a lot of unrelated "hack" rules to fix it.

Several languages allow specifying arbitrary embedded languages (markdown is an example), and having to account for every language pair combination is bad, when this could very well be solved generically. Syntax of embedded languages should be determined on a second "pass" without breaking the syntax of whatever is delimiting it in the parent language; while still allowing it to override parent escape sequences (e.g. in strings) over the embedded language.

I think this would be the ideal implementation for the best embedded language support.
Allow a subPatterns field (and an optional replacementPatterns field with it) that uses this second-pass logic. They would be mutually exclusive with patterns. This is how it could work when subPatterns is present:

  • The start..end|while rule is matched first, without considering any sub-patterns or replacement patterns. Let's say the text content between them is all stored into a innerText variable.
  • Then apply replacement patterns if they exist. They are basically the same as patterns, except they use match and a replaceWith field to specify substitution within innerText. Place the result into a subCode variable. So the sub-patterns will later operate considering these. So, for example, if a &lt; to < substitution occurs, then sub-patterns operate on this new text. This allows you to replace escaping syntax from the parent language before the sub-pattern that includes the embedded language.
    • The replaceWith field can have back-references from its match groups. Those can be the literal group text, or the unicode char from the hex or decimal number from the group (for generic unicode escape sequences).
  • Then apply subPatterns into just the subCode text atomically, on an inner/sub pass.
  • For any regions of innerText that had replacements, apply the replacement scope name on top of whatever scopes come from the sub-patterns. So this way you can inter-mix escaping syntax of both languages.

Additionally, allow parent back-references in the "include" names, so you can add any arbitrary language ids.

A theoretical example:

{
  "name": "string.quoted.embedded-code.$1.my-lang",
  "begin": "([\\w-]+)`", // group 1 is the language id
  "beginCaptures": {
    "1": { "name": "entity.other.language.my-lang" }
  },
  "end": "`",
  "contentName": "meta.embedded.block.$1 source.$1",
  "replacementPatterns": [
    // $1 would replace with the char in group 1 below literally
    { "match": "\\\\([`\\\\])", "replaceWith": "$1", "name": "constant.character.escape.my-lang" },
    // $h1 could replace with the unicode char from the hex number matched by group 1
    { "match": "\\\\u(\\h{4})", "replaceWith": "$h1", "name": "constant.character.escape.my-lang" },
    // $d1 same as above, but for decimal numbers
    { "match": "\\\\c\\[(\\d+)\\]", "replaceWith": "$d1", "name": "constant.character.escape.my-lang" },
  ],
  "subPatterns": [
    // "include" could allow back-references from the parent begin/match pattern
    // to support arbitrary languages
    { "include": "source.$1" }
  ]
}

This would let you include any arbitrary embedded language without having to know anything about its syntax, and you could even have escaping in the parent language be recognized and everything would just work.

Example code for this theoretical my-lang:
(all escapes are from my-lang, except backslash is escaped twice, for both languages)

json`
{
  "backtick": "\`",
  "backslash": "\\\\",
  "slash": "\u002F",
  "percent": "\c[37]"
}
`

These would not break the syntax in my-lang, as the inner code is isolated:

json`"`
js`//`
cpp`/*`
python`#`
csharp`(`
RedCMD referenced this issue in RedCMD/TmLanguage-Syntax-Highlighter Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant