Question: Using marked lexer/parser in a highlighting pipeline? #2687
Replies: 17 comments
-
One related question: Does marked use look-behinds in any of it's regex? (therefore breaking on Safari) |
Beta Was this translation helpful? Give feedback.
-
I don't believe we use any look-behinds. That would be doable marked.lex(`
# heading
[**bold** *text*](link.html)
`); will return tokens like: [
{type:"heading", raw:"# heading\n\n", depth:1, text:"heading", tokens:[
{type:"text", raw:"heading", text:"heading"}
]},
{type:"paragraph", raw:"[**bold** *text*](link.html)", text:"[**bold** *text*](link.html)", tokens:[
{type:"link", raw:"[**bold** *text*](link.html)", href:"link.html", title:null, text:"**bold** *text*", tokens:[
{type:"strong", raw:"**bold**", text:"bold", tokens:[
{type:"text", raw:"bold", text:"bold"}
]},
{type:"text", raw:" ", text:" "},
{type:"em", raw:"*text*", text:"text", tokens:[
{type:"text", raw:"text", text:"text"}
]}
]}
]}
] We could find a way to translate that into highlight.js tokens. |
Beta Was this translation helpful? Give feedback.
-
I'm a bit confused seeing the I suppose I could just look at your own rendering code to see how it's handling that. :) |
Beta Was this translation helpful? Give feedback.
-
Oh actually we'd have to be more careful since we'd have to look at raw too... rebuilding the markdown might result in some weird edge cases, so we'd really want the scopes AND the raw text... that might make it a bit harder I think. |
Beta Was this translation helpful? Give feedback.
-
@UziTech Is there any option to make the parser/lexer spit out more context, such as the position (index) of tokens in the original source string? |
Beta Was this translation helpful? Give feedback.
-
No the position is not saved but could be figured out from the What would the ideal highlight.js tokens look like for that markdown? |
Beta Was this translation helpful? Give feedback.
-
Are there any options to disable/configure that? (or is it required for the lexer?) (I added emphasis inside the header) I'm now thinking the simplest thing (if we had reliable start/end indexes - or could generate them easily based on walking the tree and examining |
Beta Was this translation helpful? Give feedback.
-
Right now though I'm unsure which
The same content "I am using marked." is repeated 3 times in I mean I suppose we could try to write handlers for each type of token on our side... so that we analyze the Might be time to dig into the lexer source and poke around. |
Beta Was this translation helpful? Give feedback.
-
Once marked parses the markdown into tokens it should be pretty easy to parse the raw of each token with regexps to change it into highlight.js tokens. |
Beta Was this translation helpful? Give feedback.
-
Oh yes, I'm not super worried about that part (I just wonder if there are any harder ones I'm not thinking of)... and for many tokens all we need is to wrap the text in a block (based on start/stop index)... some we really don't are about at all (paragraphs, etc)...
That'd be awesome, though I wasn't necessarily asking for anyone to do the work - I was just trying to flesh out how feasible of an approach this is. The key thing is if we have 43,838 bytes in we need the same 43,838 bytes back out (just with HTML inserted for visual styling)... since we're just highlighting the raw code, we're not doing any rendering of the Markdown. So my original idea of us providing a Of course you probably already realize that since you jumped straight to the lexer output rather than taking about the Renderer. |
Beta Was this translation helpful? Give feedback.
-
This is our internal "emitter" API that builds the token tree on our side: https://github.com/highlightjs/highlight.js/blob/main/src/lib/token_tree.js#L104 I had originally hoped we could just walk the lexer tree and then make calls into the emitter as we went. (well, second hope after realizing we can' just be a Renderer) |
Beta Was this translation helpful? Give feedback.
-
since we replace some things before tokenization it wouldn't be possible to get a complete byte for byte transformation but I think we can get close enough for the result to be usable. If we use a custom extension we get access to the whole token in the renderer functions. The one thing I am thinking might be difficult is if we want to color things sometimes but not others. (e.g tokenize bold text except when it is inside link text.) The renderer doesn't get the information about parents. |
Beta Was this translation helpful? Give feedback.
-
Yeah, I'm thinking just using the lex output directly might be simpler since it has all that information... for example for a I'm looking at the lexer now... |
Beta Was this translation helpful? Give feedback.
-
the browser doesn't distinguish between tabs and spaces anyway and since the goal of highlight.js is to produce html that should be rendered by a browser I don't know if that would be a big deal. Here is a POC just to show how our renderer could be used to output the html for highlight.js I know the goal is to get it into tokens but I don't think it would be too difficult to figure out how to change this to accomplish that. It turned out to be easy to only render emphasis in the header and not in the link text. Just don't call // marked-highlight.js
export const highlight = {
extensions: [
{
name: 'heading',
level: 'block',
renderer(token) {
const match = token.raw.match(/^(#+ +).+(\n+)$/);
const newlines = '\n<br />'.repeat(match[2].length);
const text = this.parser.parseInline(token.tokens);
return `<span class="hljs-section">${match[1]}${text}</span>${newlines}`;
}
},
{
name: 'paragraph',
level: 'block',
renderer(token) {
return this.parser.parseInline(token.tokens);
}
},
{
name: 'em',
level: 'inline',
renderer(token) {
return `<span class="hljs-emphasis">${token.raw}</span>`;
}
},
{
name: 'strong',
level: 'inline',
renderer(token) {
return `<span class="hljs-strong">${token.raw}</span>`;
}
},
{
name: 'link',
level: 'inline',
renderer(token) {
return `\n[<span class="hljs-string">${token.text}</span>](<span class="hljs-link">${token.href}</span>)`;
}
}
]
}; import { marked } from 'marked';
import { highlight } from './marked-highlight.js';
marked.use(highlight);
console.log(marked(`
# heading *emphasis*
[**bold** *text*](link.html)
`)); output: <span class="hljs-section"># heading <span class="hljs-emphasis">*emphasis*</span></span>
<br />
<br />
[<span class="hljs-string">**bold** *text*</span>](<span class="hljs-link">link.html</span>) |
Beta Was this translation helpful? Give feedback.
-
Do you know if there is a way to get non-text out of the pipeline? I turned all the
|
Beta Was this translation helpful? Give feedback.
-
The marked parser is made to output HTML in string format but you don't have to use the output. The renderer functions could instead populate some object and output an empty string. |
Beta Was this translation helpful? Give feedback.
-
It's really a problem, and I have no idea about how to put @UziTech Is there any way we can custom the pipeline and do a step-by-step processing for the content? The best will be that we can even custom the tasks order in pipeline. For example:
I don't think I need to open a new issue for what I said above, if you prefer me to do that, let me know. |
Beta Was this translation helpful? Give feedback.
-
I'm the current maintainer of Highlight.js. I'm posting this as a question because I'd like feedback on whether this is a good idea or not or if there are any big gotchas I'm not thinking of... I'm not sure we want to increase the dependencies of the core library, but perhaps we could experiment with this idea via a
highlightjs-markdown
3rd party grammar, etc...Describe the feature
I've been long considering the idea of allowing some grammars to take advantage of actual parsers for languages (rather than just a bunch of discrete regex rules)... for example when Highlight.js goes to highlight a Markdown file one might imagine the process looking a bit like this:
marked
to lex/parse the Markdown into tokens/blocksmarked
Renderer that generates an internal Highlight.jsTokenTree
from the parsedmarked
tokensTokenTree#toHTML()
which generates HTML outputThis would mean that instantly our highlighting of Markdown would gain all the fidelity and precision offered by the Marked parsing engine... much increased accuracy in exchange for larger download size (
marked
is larger than our regex grammar rules).Note: I'm not talking about Marked using
hightlight.js
to help render... I'm taking about Highlight.js usingmarked
to help highlight markdown files...Why is this feature necessary?
To improve highlighting of grammars that can't be fully expressed with simple regex rules alone.
The issue that led to my posting this: highlightjs/highlight.js#3519
But there have been many similar issues in the past... Markdown is truly hard/impossible to get right using our own internal grammar engine because it's not really super-context aware, and writing super context-aware grammars gets messy very, very fast.
Describe alternatives you've considered
More and more gnarlier regex...
Beta Was this translation helpful? Give feedback.
All reactions