0-infinite quantifier inside lookbehind causes catastrophic backtracking #278

RedCMD · 2022-12-13T05:41:02Z

original issue: microsoft/vscode-textmate#166

the regex (?<=ab*)c runs into catastrophic backtracking when matching against a long line of c's
however (?<=ab+|a)c does not have this issue

The text was updated successfully, but these errors were encountered:

kkos · 2022-12-13T13:21:33Z

(?<=ab+|a)c is transformed into (?:(?<=ab+)|(?<=a))c.
/ab+/ and /a/ have very good properties for look-behind.
/ab+/ has the property that it always ends with a 'b', and /a/ is a pattern with a fixed number of characters.
I think I should fix it so that (?<=ab*)c is converted to (?<=ab+|a)c.
But I don't know when it can be done as I don't have time.

RedCMD · 2022-12-13T18:38:44Z

sadly (?<=ab+|a)c runs into the same problem but with a long line of bc's instead

how does lookbehind work?
does it start at every character in the string and go forward?
or does it match the c first and go backwards?
how well can it handle multi-length regexs?

I noticed limiting the possible length of b* to like b{0,100} reduces lag quite a bit

the initial regex I had problems with was (?<={\d*)\\\\(?=,\d*})
the first half was causing heaps of lag on every single double \\ it found.

kkos · 2022-12-14T14:22:19Z

Look-behind is made as a type of anchor.
First find the location of c, then perform look-behind matching backward from that location.
Since ab* and ab+ have infinite length, if they do not match, they go back one character at a time to the beginning of the string.
If ab{0,100}, it will only go back a maximum of 101 characters.

Originally, it did not support variable length look-behind.
I recall that before making modifications to accommodate variable length, I considered whether to give special treatment only to cases where there is an look-behind at the beginning of the pattern.

/(?<=p1)p2/ ---> /p1\Kp2/

But I can't remember why I stopped this.
Either there is something wrong with this conversion and I have given up, or I have decided that it is not worth implementing so much because it only corresponds to cases where there is an look-behind at the beginning of the pattern.
I'll think again.

RedCMD · 2022-12-15T02:48:00Z

I assume you stopped because using \K still captures the text p1 in that regex
which is different to the 0width lookbehind

what optimizations are there around detecting a or b as being the only possible last characters in the lookbehind (?<=ab*)?
and thus checking if x is b or a in xc and stopping early
tho I would assume it to get very complex

I just tested with (?:(?<=ab+)|(?<=a))c instead of (?<=ab*)c
and there was no lag on a long line of a's, b's or c's. But there was still lag with bc's
ac and dc didn't lag either
Im testing this through vscodes textmate syntax highlighting

Is it possible to physically reverse the execution direction of the regex?
so (?<=abcdef) would check f first, then e etc with a last

kkos · 2022-12-15T14:00:53Z

No optimization exists for (?<=ab*).
For (?:(?<=ab+)|(?<=a))c , there is a minimal optimization that checks that b is immediately before c for the part of (?<=ab+).
I knew there was a way to match in the reverse direction even before I had to deal with variable length look-behind, and it would be possible.
It is just that I have no intention of implementing it.

RedCMD · 2024-05-31T04:30:44Z

I just ran into a similar problem again

here's a simplified regex a++b
it is very slow when attempting to match against a very long line of a's

however the regex ca++b does not have an issue
even against a long line of ca's or even caaaaaaaa...aaaaaaaa is fine

surprisingly the regex (?~|x|a+)b does not lag either
however on a long line aaaaaa...aaaaaxb, it lags tremendously

the original regex I had a problem with (?=[\x85[^-?:,\[\]{}#&*!|>'"%@` \p{Cntrl}\p{Surrogate}\x{FEFF FFFE FFFF}]]|[?:-](?![\r\n\t ,\[\]{}]))(?=(?>[^:#,\[\]{}]++|(?<! |\t)#++|:(?=[^\r\n\t ,\[\]{}]))++:[\r\n\t ,\[\]{}])
Testing this in the context of VSCode's TextMate engine

RedCMD · 2024-06-01T02:06:07Z

I just found an interesting work around
simply append |(*ERROR) to the end of the regex

however it does mean, the regex must match at the start of the string
but depending on your application, that may be desirable

possibly a more performant version of \G

EDIT:
Due to how VSCode TextMate caches its regex's
(*ERROR) won't work for me

kkos · 2024-06-05T14:53:42Z

I could not reproduce this slowing down.
/a++b/ and "aaa....." (length 3000) ==> 32msec.
Is ++ not a greedy operator?

RedCMD · 2024-06-05T23:43:54Z

sorry
I didn't specify that it wasn't catastrophic backtracking
for me a++b (possessive greedy) on 3000 a's is about 38ms
but if I place a single b at the end, it drops down to about 0.1ms; a 380x speed up

If I double the a's to 6000, the time increases to 155ms
12000 a's is 624ms

so it's N^2 time complexity

a+?b (reluctant), takes about 10% more time

I guess this doesn't matter too much; as it isn't catastrophic backtracking
(and it seems there's already some good optimizations working)

I'm just being a bit nitpicky
seeing a single regex take longer to scan a 12k char long file
than a complex grammar tokenizing a 200k char file in less than 500ms

kkos added the question label Dec 14, 2022

kkos closed this as completed Dec 22, 2022

tonco-miyazawa mentioned this issue Jun 13, 2024

add new operator \J (skip search) #278 #299

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0-infinite quantifier inside lookbehind causes catastrophic backtracking #278

0-infinite quantifier inside lookbehind causes catastrophic backtracking #278

RedCMD commented Dec 13, 2022

kkos commented Dec 13, 2022

RedCMD commented Dec 13, 2022 •

edited

Loading

kkos commented Dec 14, 2022

RedCMD commented Dec 15, 2022 •

edited

Loading

kkos commented Dec 15, 2022

RedCMD commented May 31, 2024 •

edited

Loading

RedCMD commented Jun 1, 2024 •

edited

Loading

kkos commented Jun 5, 2024

RedCMD commented Jun 5, 2024 •

edited

Loading

0-infinite quantifier inside lookbehind causes catastrophic backtracking #278

0-infinite quantifier inside lookbehind causes catastrophic backtracking #278

Comments

RedCMD commented Dec 13, 2022

kkos commented Dec 13, 2022

RedCMD commented Dec 13, 2022 • edited Loading

kkos commented Dec 14, 2022

RedCMD commented Dec 15, 2022 • edited Loading

kkos commented Dec 15, 2022

RedCMD commented May 31, 2024 • edited Loading

RedCMD commented Jun 1, 2024 • edited Loading

kkos commented Jun 5, 2024

RedCMD commented Jun 5, 2024 • edited Loading

RedCMD commented Dec 13, 2022 •

edited

Loading

RedCMD commented Dec 15, 2022 •

edited

Loading

RedCMD commented May 31, 2024 •

edited

Loading

RedCMD commented Jun 1, 2024 •

edited

Loading

RedCMD commented Jun 5, 2024 •

edited

Loading