-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Custom Tokenizer/Renderer extensions #2043
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/markedjs/markedjs/GBXN7V5EfC83A6yEQUUynLXjXD2h |
We should definitely only run through the extensions once and create a map like: this.extensions = {
paragraph: [
underline
]
} Then looking them up like: runTokenzierExtension(src, tokens, before) {
let tokensLength = 0;
if (this.extensions[before]) {
this.extensions[before].forEach(function(extension, index) {
if (token = extension.tokenizer(src)) {
src = src.substring(token.raw.length);
tokens.push(token);
tokensLength += token.raw.length;
}
});
}
return tokensLength;
} |
Used a similar method to your Although... this would make customizing the paragraph tokenizer much more difficult. Maybe this logic can go into the Lexer so the paragraph tokenizer remains easily editable. |
@UziTech Is there a way to clear the extensions/options between unit tests? Edit: Let's see... I've found Edit2: Bah. For some reason that causes tests in Parser-spec.js to fail in very strange ways even though I'm not touching that file at all. Text from the marked-spec.js is showing up in the results there for some reason? I wonder if the async tests are to blame. Commented out for now... |
Added an "inline" version of my underline token and the extension works. I was looking at the header-ids extension but I'm not sure how to best approach that. I see a few options:
So this brings up some thoughts:
|
You should add |
Good question. I'm not sure if it is better to have multiple ways to do the same thing. I think it would depend on whether it would slow down marked without extensions too much. |
I was imagining just a change in the Marked.js page, to handle a different extension format but we would parse it into the same marked.defaults.renderer or whatever so speed should be similar when executing. I would probably be in favor of just having one way to do things, and I also like the idea of having consistency between how extensions are formatted. |
Sounds good. We could add this way of extensions and deprecate the old way and remove it in v3 or v4 |
We could pass the parent token or something if it is not a top level token so the tokenizer or renderer could do different things depending on what type of token it is in. |
Ok, so if we do this and require each extension to be formatted as an object, what about supporting passing in both a single extension object or an array of extensions (rather than a nested object of extensions) so order is preserved for consistency. Like so:
If an extension has "overwrite", treat it as we currently treat extensions, i.e., merge into Otherwise, merge into an |
@UziTech I applied all the changes discussed above. Processing in marked.js got a bit ugly (and I'm probably missing special cases/error handling, which I am not experienced with) but the code elsewhere is much simplified now without the looping and results in an
Overwriting extensions now require existence of the property Custom extensions with Tokenizers will be added to the Custom extensions with Renderers must have a
|
Added current benchmarks to the OP. Aw man... Not sure where the slowdown is coming from yet. Edit: Ah... Most of it is because I was accidentally comparing to my (old, but faster) fork of the Master branch. I updated the original post comparing the four scenarios. |
Now that I think about it.... Do we NEED a I have this working on my local branch and it actually removes most of the slowdown. If that sounds reasonable I can commit it. |
Ya that sounds fine. If we need to allow running between other tokenizers we can add that later. |
I have thought about adding all of those things. The problem is: where do you stop? it could also be helpful to know what the parent token is ... and the grandparent token ... and all of the siblings. I think the best thing to do is say |
How would a user go about doing that? Just accessing the Lexer?
Right... Ok, what if we do this instead: after a quick google, apparently we could get the index of the current token being walked with |
yes
How is that easier than just accessing tokens from the lexer and walking them manually? If you want extensions to be able to get tokens we should do something like hooks and just have a hook that gives the extension all tokens between lex and parse and let them do whatever they want. Same with a hook for pre and post processing. For the gfm-heading-id extension it would be nice to be be able to reset the slugger in a preprocess hook instead of needing to call it manually. |
Because then the extension creator essentially has to re-write his own custom The preprocessor/postprocessor hooks would make it work with Ok, so another approach: What if inside of So if you want to change a token between two paragraphs:
Essentially the only limitation with |
Except down isn't always in |
Fair enough. I'll work on that last change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work. 👍 This is going to allow so many new things for marked.
@styfle , @joshbruce , @davisjam Have you guys had a chance to look at this PR? I know its a hefty one so we could sure use some extra eyes on it! |
I'd be lying if I said I understood it all. I appreciate that performance isn't hit too hard given this is something the community has been wanting directly (or by proxy) for a while now. It seems like at least a couple of members from the community appreciate it as well. So, I'm good. One thing I'd like to put a pin in is the use of |
# [2.1.0](v2.0.7...v2.1.0) (2021-06-15) ### Features * Custom Tokenizer/Renderer extensions ([#2043](#2043)) ([5be9d6d](5be9d6d))
🎉 This PR is included in version 2.1.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
return ret; | ||
}; | ||
if (ext.renderer) { // Renderer extensions | ||
const prevRenderer = extensions.renderers?.[ext.name]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@UziTech @calculuschild I think this line is causing the error below. Will create a seperate issue too.
node_modules/protobufjs/node_modules/marked/src/marked.js:158
const prevRenderer = extensions.renderers?.[ext.name];
^
SyntaxError: Unexpected token '.'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue here: #2107
Description
New attempt at custom Tokens, Renderers for Marked.js. Faster than #1872.
May be a possible fix for: #1373, #1693, #1695, #2061,
Users can add a custom Tokenizer and Renderer which will be executed without needing to overwrite existing components, with the extension in the following format:
The extension(s) can then be loaded like so:
Benchmarks (on my laptop):
Master from Feb 2021
Master (current)
This PR
This PR with all extensions running at the top of the Lexer rather than using "before" to spread them out:
Contributor
Committer
In most cases, this should be a different person than the contributor.