-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer #1637
Tokenizer #1637
Conversation
This pull request is being automatically deployed with ZEIT Now (learn more). 🔍 Inspect: https://zeit.co/markedjs/markedjs/gnsdjmlxq |
Nice! Looks like you got it going. I think I'll go ahead and close #1632 then. |
This should probably be cleaned up. Right now the Tokenizer functions get passed an object that has a |
What does the |
it provides the text for inline tokens |
After some refactoring it looks like we could remove the |
I would agree that the Tokenizer functions should probably just focus on generating a token. Making a user also handle all these side effects seems like it would make extension more convoluted than it needs to be. In my attempt #1632 I also removed editing the |
Yes but I think the original regex capture is important to be able to extend/change. And how would we deal with backtracking? If we want to be 100% spec compliant I think we will need to do more of that. |
Right, performing the regex capture makes sense, so we pass in
In #1632 I found ways around the backtracking in the |
I like the idea of using raw length 👍. We will still need to pass the |
I think this is just about ready. A few things to point out:
|
My understanding is objects, strings, and arrays all default to pass by reference in Javascript anyway, so really |
Correct. |
Cool. This is awesome work! |
The other thing I would like to do for extensibility is to allow multiple extensions to alter different parts of the renderer and tokenizer. For example if one extension only needs to alter I am thinking of something like this interface: // marked_code_extension
module.exports = {
tokenizer: {
code(lexer, src, tokens, top) {
// code tokenizer
}
}
renderer: {
code(code, infostring, escaped) {
// code renderer
}
}
}; // marked_html_extension
module.exports = {
tokenizer: {
html(lexer, src, tokens, top) {
// html tokenizer
}
}
renderer: {
html(html) {
// html renderer
}
}
}; const marked = require('marked')
marked.use(require('marked_code_extension'));
marked.use(require('marked_html_extension'));
const html = marked(markdown);
// code and html extensions will both be used |
This would be handy. Would this maybe fit in a separate PR? |
Yes I would do a separate PR for this but I wanted to get the idea out there before I start updating the docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate where this is going and would like to be able to make more substantive reviews. Looking at it, with limited in-depth knowledge and wanting to make sure where we're heading, I see four major parts:
- Parser: Breaks apart the plain text into constituent pieces (content components).
- Lexer: Holds the rules to be applied to the parts.
- Tokenizer (??): Holds the results of the tokens.
- Renderer: Applies the lexer rules to the parsed components.
I'm confused on some of the divisions and why they exist.
|
I think this is a good compromise. With this, users can
This seems like a pretty reasonable (and more-or-less complete) amount of control while keeping functionality compartmentalized. |
They are (not very easily) extendable by monkey patching them from the Lexer currently marked.Lexer.rules.block.html = // change default html rule I would like to do a PR after this PR gets merged to make extending the rules, Renderer, Tokenizer much easier. something like: marked.use({
rules: {
block: {
html: // change default html rule
}
},
tokenizer: {
html: // change html function in tokenizer
},
renderer: {
html: // change html function in renderer
}
}); That would allow extensions to just return that object so users could use marked with extensions like: marked.use(require("some_marked_extension")); |
}; | ||
|
||
// Run marked | ||
console.log(marked('$ latex code $', { tokenizer })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice example!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also achieve a dynamic TOC based on headings using a custom tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be easier to use the Renderer for that. example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah. So then couldn't this LaTeX example also be achieved using the Renderer instead of Tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because the dollar sign ($
) isn't a valid code token starter normally so it would just be counted as text.
I'm sure the example code would fail on edge cases but it is more about showing how to extend the Tokenizer than create a robust LaTeX interpreter.
Co-Authored-By: Steven <steven@ceriously.com>
These lines still mentioned rules as part of the Lexer. They are now only accessible from the Tokenizer, no? (Github is not letting me comment directly on those lines...) |
Nice catch. yes the rules are set on the tokenizer based on the options and the Lexer has a static property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
@calculuschild have you had a chance to review this PR. I would like your approval before merging it since you had the original idea for the Tokenizer. Does this provide the fix for your use case? |
@UziTech Yes, this should work just fine for what I need. Thank you. |
Description
This PR builds on #1627 to create a Tokenizer class that can be extended to influence the Lexer.
TODO:
Contributor
Committer
In most cases, this should be a different person than the contributor.