Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0-infinite quantifier inside look behind capture group causes tremendous performance loss #166

Closed
RedCMD opened this issue Jan 10, 2022 · 5 comments

Comments

@RedCMD
Copy link

RedCMD commented Jan 10, 2022

Creating a syntax highlighter with the following code and running it on a single line file filled with thousands of the letter c will cause tremendous lag
image

{
	"name": "Lag Syntax",
	"scopeName": "source.redcmd.syntax.lag",
	"patterns": [
		{
			"match": "(?<=ab*)c",
			"name": "strong keyword.control"
		}
	]
}

Replacing the * with a + or changing the regex to (?<=ab+|a)c will remove all lag and allow my machine to easily interpret a million character single line file
But then a file filled with bc will lag instead

I cant seem to figure out how the engine matches lookbehinds
does it start with the c first then try to match the lookbehind afterwards?
or does it try to match the lookbehind first, then the c?

using {0,10000} instead of * reduces lag massively
which seems to suggest that the engine is trying to match b an infinite amount (or an internal limit) of times before giving up
but for some reason it only does so when its allowed to match 0 times??
maybe its an underflow error?

But then using the regex (?<=a(b+)?)c also causes an enormous amount of lag
which I don't get
unless.. (b+)? gets optmized to b*

or maybe cause the engine backtracks in the wrong direction and ends up endlessly rechecking b?

Noticed single line files were only getting partially highlighted
image
Dug around and found out it was caused by this regex:
image
(?<={\\d*)\\\\{2} is the part that was going over board on every single \\

@jeff-hykin
Copy link

This might be causing some of the serious performance issues in the C++ syntax as well.

Part of our preprocessing steps in the C++ grammar optimize (b+)? to become b* so that's probably not doing us any favors

@RedCMD
Copy link
Author

RedCMD commented Jan 11, 2022

I just looked through all c, cpp, cpp.embed and cuda-cpp grammar files and did not see this specfic issue once.

@jeff-hykin
Copy link

Oh, well thanks for saving me the trouble @RedCMD. I suppose it's just regular catastrophic backtracking then for C++

@RedCMD
Copy link
Author

RedCMD commented Jan 25, 2022

@jeff-hykin
This may be causing a lot of the C++ performance issues
I notice many "#inline_comment" includes under captures
Capturing and applying a pattern causes performance loss: #167

@alexdima
Copy link
Member

Thank you for reaching out! I think the regex (?<=ab*)c might be running into catastrophic backtracking. Unfortunately, I don't know enough about the inner workings of a regex engine to help deeper than that. We are using oniguruma as the underlying regular expression engine. The author of oniguruma has been incredibly helpful and responsive in the past.

@kkos do you think there are optimization opportunities in oniguruma around such regular expressions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants