Question: Using marked lexer/parser in a highlighting pipeline? #2687

joshgoebel · 2022-04-14T20:26:22Z

joshgoebel
Apr 14, 2022

I'm the current maintainer of Highlight.js. I'm posting this as a question because I'd like feedback on whether this is a good idea or not or if there are any big gotchas I'm not thinking of... I'm not sure we want to increase the dependencies of the core library, but perhaps we could experiment with this idea via a highlightjs-markdown 3rd party grammar, etc...

Describe the feature

I've been long considering the idea of allowing some grammars to take advantage of actual parsers for languages (rather than just a bunch of discrete regex rules)... for example when Highlight.js goes to highlight a Markdown file one might imagine the process looking a bit like this:

call out to marked to lex/parse the Markdown into tokens/blocks
provide a custom marked Renderer that generates an internal Highlight.js TokenTree from the parsed marked tokens
finish with our own standard TokenTree#toHTML() which generates HTML output

This would mean that instantly our highlighting of Markdown would gain all the fidelity and precision offered by the Marked parsing engine... much increased accuracy in exchange for larger download size (marked is larger than our regex grammar rules).

Note: I'm not talking about Marked using hightlight.js to help render... I'm taking about Highlight.js using marked to help highlight markdown files...

Why is this feature necessary?

To improve highlighting of grammars that can't be fully expressed with simple regex rules alone.

The issue that led to my posting this: highlightjs/highlight.js#3519

But there have been many similar issues in the past... Markdown is truly hard/impossible to get right using our own internal grammar engine because it's not really super-context aware, and writing super context-aware grammars gets messy very, very fast.

Describe alternatives you've considered

More and more gnarlier regex...

joshgoebel · 2022-04-14T20:39:29Z

joshgoebel
Apr 14, 2022
Author

One related question: Does marked use look-behinds in any of it's regex? (therefore breaking on Safari)

0 replies

UziTech · 2022-04-14T20:59:52Z

UziTech
Apr 14, 2022
Maintainer

I don't believe we use any look-behinds.

That would be doable

marked.lex(`
# heading

[**bold** *text*](link.html)
`);

will return tokens like:

[
	{type:"heading", raw:"# heading\n\n", depth:1, text:"heading", tokens:[
  	{type:"text", raw:"heading", text:"heading"}
	]},
	{type:"paragraph", raw:"[**bold** *text*](link.html)", text:"[**bold** *text*](link.html)", tokens:[
  	{type:"link", raw:"[**bold** *text*](link.html)", href:"link.html", title:null, text:"**bold** *text*", tokens:[
    	{type:"strong", raw:"**bold**", text:"bold", tokens:[
      	{type:"text", raw:"bold", text:"bold"}
			]},
    	{type:"text", raw:" ", text:" "},
    	{type:"em", raw:"*text*", text:"text", tokens:[
      	{type:"text", raw:"text", text:"text"}
			]}
		]}
	]}
]

We could find a way to translate that into highlight.js tokens.

0 replies

joshgoebel · 2022-04-14T21:07:21Z

joshgoebel
Apr 14, 2022
Author

I'm a bit confused seeing the text repeated multiple times but would a good rule be to ignore the parent text attribute in cases where there are tokens children nodes? Is all the actual verbatim text content going to get dumped into a type:text node at the bottom of the tree eventually?

I suppose I could just look at your own rendering code to see how it's handling that. :)

0 replies

joshgoebel · 2022-04-14T21:08:57Z

joshgoebel
Apr 14, 2022
Author

Oh actually we'd have to be more careful since we'd have to look at raw too... rebuilding the markdown might result in some weird edge cases, so we'd really want the scopes AND the raw text... that might make it a bit harder I think.

0 replies

joshgoebel · 2022-04-14T21:42:10Z

joshgoebel
Apr 14, 2022
Author

@UziTech Is there any option to make the parser/lexer spit out more context, such as the position (index) of tokens in the original source string?

0 replies

UziTech · 2022-04-15T02:43:40Z

UziTech
Apr 15, 2022
Maintainer

No the position is not saved but could be figured out from the raw text. It is a little difficult because some text is converted right away (like \r\n to \n and tabs to 4 spaces) so raw isn't exactly what is sent in but it is as close as we can get.

What would the ideal highlight.js tokens look like for that markdown?

0 replies

joshgoebel · 2022-04-15T03:04:14Z

joshgoebel
Apr 15, 2022
Author

some text is converted right away

Are there any options to disable/configure that? (or is it required for the lexer?)

(I added emphasis inside the header)

I'm now thinking the simplest thing (if we had reliable start/end indexes - or could generate them easily based on walking the tree and examining raw) might be to just dynamically insert the scopes on our end based on the start/stop positions (which we'd need from marked)... and then for anything we want to "rewrite" significantly it gets tricker... you'll see that we think of links entirely differently, as a string and url/link component... with the string component not being further processed.

0 replies

joshgoebel · 2022-04-15T03:17:01Z

joshgoebel
Apr 15, 2022
Author

Right now though I'm unsure which raw components we'd even want to add up though... because it's not as simple as "ignore the parent, always add up the children" or anything (at a glance), say taking the example:

> require('marked').lexer('> I am using marked.')
[
  {
    type: "blockquote",
    raw: "> I am using marked.",
    tokens: [
      {
        type: "paragraph",
        raw: "I am using marked.",
        text: "I am using marked.",
        tokens: [
          {
            type: "text",
            raw: "I am using marked.",
            text: "I am using marked."
          }
        ]
      }
    ]
  },
  links: {}
]

The same content "I am using marked." is repeated 3 times in raw... and the top-most node also includes the prefix "> " for the block quote... we'd need to know that block quote started at index 0 and that paragraph started at position 2 (or 3, etc, it would depend in the whitespace I imagine)...

I mean I suppose we could try to write handlers for each type of token on our side... so that we analyze the >... prefix ourselves, then increment the index accordingly before we jump into the children... I think I was hoping it might be easier than that... the big benefit would be if we could leverage marked with just some very simple glue code. As soon as we have to build a whole in-between analyzer... the benefits start to disappear fast.

Might be time to dig into the lexer source and poke around.

0 replies

UziTech · 2022-04-15T03:20:18Z

UziTech
Apr 15, 2022
Maintainer

Once marked parses the markdown into tokens it should be pretty easy to parse the raw of each token with regexps to change it into highlight.js tokens.
I'll try to write a poc tonight.

0 replies

joshgoebel · 2022-04-15T03:26:29Z

joshgoebel
Apr 15, 2022
Author

Once marked parses the markdown into tokens it should be pretty easy to parse the raw of each token with regexps to change it into highlight.js tokens.

Oh yes, I'm not super worried about that part (I just wonder if there are any harder ones I'm not thinking of)... and for many tokens all we need is to wrap the text in a block (based on start/stop index)... some we really don't are about at all (paragraphs, etc)...

I'll try to write a poc tonight.

That'd be awesome, though I wasn't necessarily asking for anyone to do the work - I was just trying to flesh out how feasible of an approach this is. The key thing is if we have 43,838 bytes in we need the same 43,838 bytes back out (just with HTML inserted for visual styling)... since we're just highlighting the raw code, we're not doing any rendering of the Markdown.

So my original idea of us providing a Renderer wouldn't work at all. One example: The rendered API has a hr() method, but it doesn't seem to get access to the raw text. So we need to not only know "a HR goes here" but we need to know whether the original raw text was ===\n or =========\n (or some other variant).

Of course you probably already realize that since you jumped straight to the lexer output rather than taking about the Renderer.

0 replies

joshgoebel · 2022-04-15T03:29:37Z

joshgoebel
Apr 15, 2022
Author

This is our internal "emitter" API that builds the token tree on our side: https://github.com/highlightjs/highlight.js/blob/main/src/lib/token_tree.js#L104

I had originally hoped we could just walk the lexer tree and then make calls into the emitter as we went. (well, second hope after realizing we can' just be a Renderer)

0 replies

UziTech · 2022-04-15T03:48:31Z

UziTech
Apr 15, 2022
Maintainer

since we replace some things before tokenization it wouldn't be possible to get a complete byte for byte transformation but I think we can get close enough for the result to be usable. If we use a custom extension we get access to the whole token in the renderer functions.

The one thing I am thinking might be difficult is if we want to color things sometimes but not others. (e.g tokenize bold text except when it is inside link text.) The renderer doesn't get the information about parents.

0 replies

joshgoebel · 2022-04-15T04:01:08Z

joshgoebel
Apr 15, 2022
Author

The renderer doesn't get the information about parents.

Yeah, I'm thinking just using the lex output directly might be simpler since it has all that information... for example for a link we'd probably just take the raw attribute and quickly process that to spit out the string and link portions.

I'm looking at the lexer now... \r\n is already known to be bad mojo with Highlight.js so the \n replacement is something we could probably live with (and may even enforce in the future anyways)... so I think that just leaves the tabs hmmmm.

0 replies

UziTech · 2022-04-15T04:46:05Z

UziTech
Apr 15, 2022
Maintainer

the browser doesn't distinguish between tabs and spaces anyway and since the goal of highlight.js is to produce html that should be rendered by a browser I don't know if that would be a big deal.

Here is a POC just to show how our renderer could be used to output the html for highlight.js I know the goal is to get it into tokens but I don't think it would be too difficult to figure out how to change this to accomplish that.

It turned out to be easy to only render emphasis in the header and not in the link text. Just don't call parseInline on the tokens for the link.

// marked-highlight.js

export const highlight = {
  extensions: [
    {
      name: 'heading',
      level: 'block',
      renderer(token) {
        const match = token.raw.match(/^(#+ +).+(\n+)$/);
        const newlines = '\n<br />'.repeat(match[2].length);
        const text = this.parser.parseInline(token.tokens);
        return `<span class="hljs-section">${match[1]}${text}</span>${newlines}`;
      }
    },
    {
      name: 'paragraph',
      level: 'block',
      renderer(token) {
        return this.parser.parseInline(token.tokens);
      }
    },
    {
      name: 'em',
      level: 'inline',
      renderer(token) {
        return `<span class="hljs-emphasis">${token.raw}</span>`;
      }
    },
    {
      name: 'strong',
      level: 'inline',
      renderer(token) {
        return `<span class="hljs-strong">${token.raw}</span>`;
      }
    },
    {
      name: 'link',
      level: 'inline',
      renderer(token) {
        return `\n[<span class="hljs-string">${token.text}</span>](<span class="hljs-link">${token.href}</span>)`;
      }
    }
  ]
};

import { marked } from 'marked';
import { highlight } from './marked-highlight.js';

marked.use(highlight);

console.log(marked(`
# heading *emphasis*

[**bold** *text*](link.html)
`));

output:

<span class="hljs-section"># heading <span class="hljs-emphasis">*emphasis*</span></span>
<br />
<br />
[<span class="hljs-string">**bold** *text*</span>](<span class="hljs-link">link.html</span>)

0 replies

joshgoebel · 2022-07-14T14:56:40Z

joshgoebel
Jul 14, 2022
Author

I know the goal is to get it into tokens but I don't think it would be too difficult to figure out how to change this to accomplish that.

Do you know if there is a way to get non-text out of the pipeline? I turned all the return values into objects btu all I get at the end is a string of:

[object Object][object Object][object Object]

0 replies

UziTech · 2022-07-14T16:56:47Z

UziTech
Jul 14, 2022
Maintainer

The marked parser is made to output HTML in string format but you don't have to use the output. The renderer functions could instead populate some object and output an empty string.

0 replies

scruel · 2022-09-27T06:27:42Z

scruel
Sep 27, 2022

Right now though I'm unsure which raw components we'd even want to add up though... because it's not as simple as "ignore the parent, always add up the children" or anything (at a glance), say taking the example:
> require('marked').lexer('> I am using marked.')
[
  {
    type: "blockquote",
    raw: "> I am using marked.",
    tokens: [
      {
        type: "paragraph",
        raw: "I am using marked.",
        text: "I am using marked.",
        tokens: [
          {
            type: "text",
            raw: "I am using marked.",
            text: "I am using marked."
          }
        ]
      }
    ]
  },
  links: {}
]
The same content "I am using marked." is repeated 3 times in raw... and the top-most node also includes the prefix "> " for the block quote... we'd need to know that block quote started at index 0 and that paragraph started at position 2 (or 3, etc, it would depend in the whitespace I imagine)...

I mean I suppose we could try to write handlers for each type of token on our side... so that we analyze the >... prefix ourselves, then increment the index accordingly before we jump into the children... I think I was hoping it might be easier than that... the big benefit would be if we could leverage marked with just some very simple glue code. As soon as we have to build a whole in-between analyzer... the benefits start to disappear fast.

Might be time to dig into the lexer source and poke around.

It's really a problem, and I have no idea about how to put lex result back to the content, what I am focusing on will only be the parts which are not code, rather than everything.
Even lex can return all information (like index) we needed for customizing the content, I still think we might need to execute the rendering tasks several times, if I want to handle different types of content, so that it definitely will slow the site.

@UziTech Is there any way we can custom the pipeline and do a step-by-step processing for the content? The best will be that we can even custom the tasks order in pipeline.

For example:

```code```\n\n<!-- commnents --> here\n\n```<!-- commnents -->```\n# <!-- commnents --> in heading `code`\n
marked(content, [
'code',
'custom1',
'heading',
'custom2',
]);

Code step (Should only use code rules to handle the content):

<pre><code>code</pre></code>\n\n<!-- commnents --> here\n\n<pre></code><!-- commnents --><pre></code>\n# <!-- commnents --> in heading <code>code</code>\n

custom parser – Comment Replacer
(can use Marked's render or render content as you wish)

// Replace '<!-- custom commnents -->' into 'text' (will ignore the code block):
<pre><code>code</pre></code>\n\ntext here\n\n<pre></code><!-- commnents --><pre></code>\n# text in heading <code>code</code>\n

heading step

<pre><code>code</pre></code>\n\ntext here\n\n<pre></code><!-- commnents --><pre></code>\n<h1>text in heading <code>code</code></h1>\n

custom parser – Add custom links to heading

<pre><code>code</pre></code>\n\ntext here\n\n<pre></code><!-- commnents --><pre></code>\n<h1 href='#text-1024'>text in heading <code>code</code></h1>\n

return result

I don't think I need to open a new issue for what I said above, if you prefer me to do that, let me know.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Using marked lexer/parser in a highlighting pipeline? #2687

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 17 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question: Using marked lexer/parser in a highlighting pipeline? #2687

joshgoebel Apr 14, 2022

Replies: 17 comments

joshgoebel Apr 14, 2022 Author

UziTech Apr 14, 2022 Maintainer

joshgoebel Apr 14, 2022 Author

joshgoebel Apr 14, 2022 Author

joshgoebel Apr 14, 2022 Author

UziTech Apr 15, 2022 Maintainer

joshgoebel Apr 15, 2022 Author

joshgoebel Apr 15, 2022 Author

UziTech Apr 15, 2022 Maintainer

joshgoebel Apr 15, 2022 Author

joshgoebel Apr 15, 2022 Author

UziTech Apr 15, 2022 Maintainer

joshgoebel Apr 15, 2022 Author

UziTech Apr 15, 2022 Maintainer

joshgoebel Jul 14, 2022 Author

UziTech Jul 14, 2022 Maintainer

scruel Sep 27, 2022

joshgoebel
Apr 14, 2022

joshgoebel
Apr 14, 2022
Author

UziTech
Apr 14, 2022
Maintainer

joshgoebel
Apr 14, 2022
Author

joshgoebel
Apr 14, 2022
Author

joshgoebel
Apr 14, 2022
Author

UziTech
Apr 15, 2022
Maintainer

joshgoebel
Apr 15, 2022
Author

joshgoebel
Apr 15, 2022
Author

UziTech
Apr 15, 2022
Maintainer

joshgoebel
Apr 15, 2022
Author

joshgoebel
Apr 15, 2022
Author

UziTech
Apr 15, 2022
Maintainer

joshgoebel
Apr 15, 2022
Author

UziTech
Apr 15, 2022
Maintainer

joshgoebel
Jul 14, 2022
Author

UziTech
Jul 14, 2022
Maintainer

scruel
Sep 27, 2022