Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer #1637

Merged
merged 15 commits into from
Apr 16, 2020
Merged

Tokenizer #1637

merged 15 commits into from
Apr 16, 2020

Conversation

UziTech
Copy link
Member

@UziTech UziTech commented Apr 3, 2020

Description

This PR builds on #1627 to create a Tokenizer class that can be extended to influence the Lexer.

TODO:

  • Write unit tests
  • Make benchmarks faster (currently same as master)
  • Update docs

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

  • Draft GitHub release notes have been updated.
  • CI is green (no forced merge required).
  • Merge PR

@vercel
Copy link

vercel bot commented Apr 3, 2020

This pull request is being automatically deployed with ZEIT Now (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://zeit.co/markedjs/markedjs/gnsdjmlxq
✅ Preview: https://markedjs-git-fork-uzitech-tokenizer.markedjs.now.sh

@calculuschild
Copy link
Contributor

calculuschild commented Apr 3, 2020

Nice! Looks like you got it going. I think I'll go ahead and close #1632 then.

@UziTech
Copy link
Member Author

UziTech commented Apr 3, 2020

This should probably be cleaned up. Right now the Tokenizer functions get passed an object that has a src and out property that needs to be edited along with returning a token. It seems like a weird way to make the user edit the src and out.

@calculuschild
Copy link
Contributor

What does the out property do?

@UziTech
Copy link
Member Author

UziTech commented Apr 3, 2020

it provides the text for inline tokens

@UziTech
Copy link
Member Author

UziTech commented Apr 3, 2020

After some refactoring it looks like we could remove the out property if we dont care to have the text for inline tokens that can have other inline tokens inside of them (strong, em, and del).

@calculuschild
Copy link
Contributor

calculuschild commented Apr 4, 2020

It seems like a weird way to make the user edit the src and out.

I would agree that the Tokenizer functions should probably just focus on generating a token. Making a user also handle all these side effects seems like it would make extension more convoluted than it needs to be.

In my attempt #1632 I also removed editing the src from the Token functions. I imagine the same could be done here, so src is only really needed to do the initial regex capture and then if a token is returned the lexer will update src appropriately.

@UziTech
Copy link
Member Author

UziTech commented Apr 5, 2020

Yes but I think the original regex capture is important to be able to extend/change. And how would we deal with backtracking? If we want to be 100% spec compliant I think we will need to do more of that.

@calculuschild
Copy link
Contributor

I think the original regex capture is important to be able to extend/change.

Right, performing the regex capture makes sense, so we pass in src for that, but I don't think it's necessary to edit src in the tokenizer. Then again, isn't that why the rules is already extendable?

And how would we deal with backtracking?

In #1632 I found ways around the backtracking in the list token. If that's not enough, perhaps we can simply use the length of the "raw" text in the tokens and consume that much of src.

@UziTech
Copy link
Member Author

UziTech commented Apr 5, 2020

I like the idea of using raw length 👍. We will still need to pass the src to the tokenizer functions but they won't need to update it.

@UziTech
Copy link
Member Author

UziTech commented Apr 6, 2020

I think this is just about ready. A few things to point out:

  1. The block text function is called text and the inline text function is called inlineText.
  2. I left smartypants and mangle as public functions so the user could extend them if they want.
  3. Some tokens (em, strong, del, link) don't have a text property since they have inline tokens. We could add a text property from the tokens property but I don't think it is necessary.
  4. I am passing src and other variables by value instead of in an object but I didn't see any slow down in the benchmarks.

@calculuschild
Copy link
Contributor

I am passing src and other variables by value instead of in an object but I didn't see any slow down in the benchmarks.

My understanding is objects, strings, and arrays all default to pass by reference in Javascript anyway, so really top is the only one that would make a difference in speed, no?

@UziTech
Copy link
Member Author

UziTech commented Apr 6, 2020

My understanding is objects, strings, and arrays all default to pass by reference in Javascript anyway, so really top is the only one that would make a difference in speed, no?

Correct.

@calculuschild
Copy link
Contributor

Cool. This is awesome work!

@UziTech
Copy link
Member Author

UziTech commented Apr 6, 2020

The other thing I would like to do for extensibility is to allow multiple extensions to alter different parts of the renderer and tokenizer. For example if one extension only needs to alter code and another extension only affects html they won't override each other.

I am thinking of something like this interface:

// marked_code_extension
module.exports = {
	tokenizer: {
		code(lexer, src, tokens, top) {
			// code tokenizer
		}
	}
	renderer: {
		code(code, infostring, escaped) {
			// code renderer
		}
	}
};
// marked_html_extension
module.exports = {
	tokenizer: {
		html(lexer, src, tokens, top) {
			// html tokenizer
		}
	}
	renderer: {
		html(html) {
			// html renderer
		}
	}
};
const marked = require('marked')
marked.use(require('marked_code_extension'));
marked.use(require('marked_html_extension'));

const html = marked(markdown);
// code and html extensions will both be used

@UziTech UziTech requested a review from styfle April 6, 2020 16:58
@calculuschild
Copy link
Contributor

I am thinking of something like this interface:

This would be handy. Would this maybe fit in a separate PR?

@UziTech
Copy link
Member Author

UziTech commented Apr 6, 2020

This would be handy. Would this maybe fit in a separate PR?

Yes I would do a separate PR for this but I wanted to get the idea out there before I start updating the docs.

Copy link
Member

@joshbruce joshbruce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate where this is going and would like to be able to make more substantive reviews. Looking at it, with limited in-depth knowledge and wanting to make sure where we're heading, I see four major parts:

  1. Parser: Breaks apart the plain text into constituent pieces (content components).
  2. Lexer: Holds the rules to be applied to the parts.
  3. Tokenizer (??): Holds the results of the tokens.
  4. Renderer: Applies the lexer rules to the parsed components.

I'm confused on some of the divisions and why they exist.

@UziTech
Copy link
Member Author

UziTech commented Apr 14, 2020

@joshbruce

  1. Lexer takes the markdown and sends it to the Tokenizer functions.
  2. Tokenizer uses rules to create tokens.
  3. Parser takes tokens and sends them to the Renderer functions.
  4. Renderer returns html for output.

@calculuschild
Copy link
Contributor

calculuschild commented Apr 14, 2020

I think this is a good compromise. With this, users can

  1. change the rules for capturing text (via extending rules) (is this possible?)
  2. change what that captured text means (via extending tokenizer)
  3. change what output is generated (via extending renderer)

This seems like a pretty reasonable (and more-or-less complete) amount of control while keeping functionality compartmentalized.

@UziTech
Copy link
Member Author

UziTech commented Apr 14, 2020

change the rules for capturing text (via extending rules) (is this possible?)

They are (not very easily) extendable by monkey patching them from the Lexer currently

marked.Lexer.rules.block.html = // change default html rule

I would like to do a PR after this PR gets merged to make extending the rules, Renderer, Tokenizer much easier. something like:

marked.use({
	rules: {
		block: {
			html: // change default html rule
		}
	}, 
	tokenizer: {
		html: // change html function in tokenizer
	}, 
	renderer: {
		html: // change html function in renderer
	}
});

That would allow extensions to just return that object so users could use marked with extensions like:

marked.use(require("some_marked_extension"));

};

// Run marked
console.log(marked('$ latex code $', { tokenizer }));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also achieve a dynamic TOC based on headings using a custom tokenizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be easier to use the Renderer for that. example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah. So then couldn't this LaTeX example also be achieved using the Renderer instead of Tokenizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because the dollar sign ($) isn't a valid code token starter normally so it would just be counted as text.

I'm sure the example code would fail on edge cases but it is more about showing how to extend the Tokenizer than create a robust LaTeX interpreter.

docs/USING_PRO.md Outdated Show resolved Hide resolved
Co-Authored-By: Steven <steven@ceriously.com>
@calculuschild
Copy link
Contributor

These lines still mentioned rules as part of the Lexer. They are now only accessible from the Tokenizer, no? (Github is not letting me comment directly on those lines...)

@UziTech
Copy link
Member Author

UziTech commented Apr 15, 2020

Nice catch. yes the rules are set on the tokenizer based on the options and the Lexer has a static property rules where the user can access all of the rules.

Copy link
Member

@styfle styfle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

@UziTech
Copy link
Member Author

UziTech commented Apr 16, 2020

@calculuschild have you had a chance to review this PR. I would like your approval before merging it since you had the original idea for the Tokenizer. Does this provide the fix for your use case?

@calculuschild
Copy link
Contributor

@UziTech Yes, this should work just fine for what I need. Thank you.

@UziTech UziTech merged commit 904c974 into markedjs:master Apr 16, 2020
@UziTech UziTech deleted the tokenizer branch April 16, 2020 18:31
@UziTech UziTech mentioned this pull request Apr 20, 2020
12 tasks
zhenalexfan pushed a commit to zhenalexfan/MarkdownHan that referenced this pull request Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants