Caching lookaheads for speed gains #1822

calculuschild · 2020-11-14T02:08:06Z

calculuschild
Nov 14, 2020
Collaborator

Running a rather large markdown file through the demo page and looking at the Google Inspector profiler, I noticed a lot of time spent on the paragraph regex. Given that paragraphs are probably one of the most common elements in a typical document, I wonder if we can speed this up...

The regex relies on a lot of lookaheads to make sure it isn't interrupted by various block elements. In the case that it does detect one of those items, does it make sense to cache that result and immediately apply it as the next token? Would that even give a noticeable boost at all?

UziTech · 2020-11-20T23:25:16Z

UziTech
Nov 20, 2020
Maintainer

I'm not sure how you can capture a look ahead to cache it.

0 replies

calculuschild · 2020-11-21T00:36:53Z

calculuschild
Nov 21, 2020
Collaborator Author

It seems you can directly capture a positive lookahead `(?=(capture)), but not a negative one, since by definition it will be avoided.

So... Could we maybe swap the negative lookaheads to positive? Or... just another normal capture group? If an interrupter is found, the paragraph token is just the first part of the regex, and the next token will be the second part. If not found, it's just a normal uninterrupted paragraph token.

...Something like this?

/^(([^\n]+)(?:(?:\n(hr|heading|lheading|blockquote|fences|list|html))|(?:\n[^\n]+)*))/

capture \1 = whole regex (multi-paragraph token)
capture \2 = first paragraph
capture \3 = interruption

if(length(\3) == 0)
  paragraphToken = \1

else
  paragraphToken = \2
  nextToken = \3

I can already see some potential flaws here (what if the third paragraph has an interrupter), but maybe it's a starting point....

Or.... we could just limit the paragraph regex to one newline at a time, and then in the tokenizer if there are multiple paragraph tokens right next to each other, group them together?

0 replies

UziTech · 2020-11-21T04:10:55Z

UziTech
Nov 21, 2020
Maintainer

The Lexer groups together text tokens in a similar way.

https://github.com/markedjs/marked/blob/master/src/Lexer.js#L237

0 replies

calculuschild · 2020-11-21T05:46:58Z

calculuschild
Nov 21, 2020
Collaborator Author

FYI: redesigning the regex to capture the interrupters and immediately push them to the token array doesn't seem to give any noticeable speedup. https://github.com/calculuschild/marked/tree/refactorParagraphs

Just for fun, going to try out the other method of tokenizing paragraphs individually.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching lookaheads for speed gains #1822

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Caching lookaheads for speed gains #1822

calculuschild Nov 14, 2020 Collaborator

Replies: 4 comments

UziTech Nov 20, 2020 Maintainer

calculuschild Nov 21, 2020 Collaborator Author

UziTech Nov 21, 2020 Maintainer

calculuschild Nov 21, 2020 Collaborator Author

calculuschild
Nov 14, 2020
Collaborator

UziTech
Nov 20, 2020
Maintainer

calculuschild
Nov 21, 2020
Collaborator Author

UziTech
Nov 21, 2020
Maintainer

calculuschild
Nov 21, 2020
Collaborator Author