Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0-infinite quantifier inside lookbehind causes catastrophic backtracking #278

Closed
RedCMD opened this issue Dec 13, 2022 · 9 comments
Closed
Labels

Comments

@RedCMD
Copy link

RedCMD commented Dec 13, 2022

original issue: microsoft/vscode-textmate#166

the regex (?<=ab*)c runs into catastrophic backtracking when matching against a long line of c's
however (?<=ab+|a)c does not have this issue

@kkos
Copy link
Owner

kkos commented Dec 13, 2022

(?<=ab+|a)c is transformed into (?:(?<=ab+)|(?<=a))c.
/ab+/ and /a/ have very good properties for look-behind.
/ab+/ has the property that it always ends with a 'b', and /a/ is a pattern with a fixed number of characters.
I think I should fix it so that (?<=ab*)c is converted to (?<=ab+|a)c.
But I don't know when it can be done as I don't have time.

@RedCMD
Copy link
Author

RedCMD commented Dec 13, 2022

sadly (?<=ab+|a)c runs into the same problem but with a long line of bc's instead

how does lookbehind work?
does it start at every character in the string and go forward?
or does it match the c first and go backwards?
how well can it handle multi-length regexs?

I noticed limiting the possible length of b* to like b{0,100} reduces lag quite a bit

the initial regex I had problems with was (?<={\d*)\\\\(?=,\d*})
the first half was causing heaps of lag on every single double \\ it found.

@kkos
Copy link
Owner

kkos commented Dec 14, 2022

Look-behind is made as a type of anchor.
First find the location of c, then perform look-behind matching backward from that location.
Since ab* and ab+ have infinite length, if they do not match, they go back one character at a time to the beginning of the string.
If ab{0,100}, it will only go back a maximum of 101 characters.

Originally, it did not support variable length look-behind.
I recall that before making modifications to accommodate variable length, I considered whether to give special treatment only to cases where there is an look-behind at the beginning of the pattern.

/(?<=p1)p2/ ---> /p1\Kp2/ 

But I can't remember why I stopped this.
Either there is something wrong with this conversion and I have given up, or I have decided that it is not worth implementing so much because it only corresponds to cases where there is an look-behind at the beginning of the pattern.
I'll think again.

@kkos kkos added the question label Dec 14, 2022
@RedCMD
Copy link
Author

RedCMD commented Dec 15, 2022

I assume you stopped because using \K still captures the text p1 in that regex
which is different to the 0width lookbehind

what optimizations are there around detecting a or b as being the only possible last characters in the lookbehind (?<=ab*)?
and thus checking if x is b or a in xc and stopping early
tho I would assume it to get very complex

I just tested with (?:(?<=ab+)|(?<=a))c instead of (?<=ab*)c
and there was no lag on a long line of a's, b's or c's. But there was still lag with bc's
ac and dc didn't lag either
Im testing this through vscodes textmate syntax highlighting

Is it possible to physically reverse the execution direction of the regex?
so (?<=abcdef) would check f first, then e etc with a last

@kkos
Copy link
Owner

kkos commented Dec 15, 2022

No optimization exists for (?<=ab*).
For (?:(?<=ab+)|(?<=a))c , there is a minimal optimization that checks that b is immediately before c for the part of (?<=ab+).
I knew there was a way to match in the reverse direction even before I had to deal with variable length look-behind, and it would be possible.
It is just that I have no intention of implementing it.

@kkos kkos closed this as completed Dec 22, 2022
@RedCMD
Copy link
Author

RedCMD commented May 31, 2024

I just ran into a similar problem again

here's a simplified regex a++b
it is very slow when attempting to match against a very long line of a's

however the regex ca++b does not have an issue
even against a long line of ca's or even caaaaaaaa...aaaaaaaa is fine

surprisingly the regex (?~|x|a+)b does not lag either
however on a long line aaaaaa...aaaaaxb, it lags tremendously

the original regex I had a problem with (?=[\x85[^-?:,\[\]{}#&*!|>'"%@` \p{Cntrl}\p{Surrogate}\x{FEFF FFFE FFFF}]]|[?:-](?![\r\n\t ,\[\]{}]))(?=(?>[^:#,\[\]{}]++|(?<! |\t)#++|:(?=[^\r\n\t ,\[\]{}]))++:[\r\n\t ,\[\]{}])
Testing this in the context of VSCode's TextMate engine

@RedCMD
Copy link
Author

RedCMD commented Jun 1, 2024

I just found an interesting work around
simply append |(*ERROR) to the end of the regex

however it does mean, the regex must match at the start of the string
but depending on your application, that may be desirable

possibly a more performant version of \G

EDIT:
Due to how VSCode TextMate caches its regex's
(*ERROR) won't work for me

@kkos
Copy link
Owner

kkos commented Jun 5, 2024

I could not reproduce this slowing down.
/a++b/ and "aaa....." (length 3000) ==> 32msec.
Is ++ not a greedy operator?

@RedCMD
Copy link
Author

RedCMD commented Jun 5, 2024

sorry
I didn't specify that it wasn't catastrophic backtracking
for me a++b (possessive greedy) on 3000 a's is about 38ms
but if I place a single b at the end, it drops down to about 0.1ms; a 380x speed up

If I double the a's to 6000, the time increases to 155ms
12000 a's is 624ms

so it's N^2 time complexity

a+?b (reluctant), takes about 10% more time

I guess this doesn't matter too much; as it isn't catastrophic backtracking
(and it seems there's already some good optimizations working)

I'm just being a bit nitpicky
seeing a single regex take longer to scan a 12k char long file
than a complex grammar tokenizing a 200k char file in less than 500ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants