-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stack overflow in RegExp for long string #12462
Conversation
This prevents hitting stack overflow errors on long string inputs. Closes apache#11537
@@ -30,6 +30,7 @@ | |||
package org.apache.lucene.util.automaton; | |||
|
|||
import java.io.IOException; | |||
import java.util.ArrayDeque; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using ArrayDeque
because we expect it to be faster than a Stack
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using ArrayDeque because Stack is in forbidden apis, see message:
java.util.Stack @ Use more modern java.util.ArrayDeque as it is not
synchronized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stack
is old and not preferred. Use Deque
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, now using Deque
type.
@@ -1067,22 +1068,44 @@ private boolean check(int flag) { | |||
} | |||
|
|||
final RegExp parseUnionExp() throws IllegalArgumentException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These three methods look similar. Is it worth trying to extract out their bodies to a common method?
This new method would take 3 functions as arguments, one to call in the do-while
loop, one for the do-while
condition, and one to call in the while
loop after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I will try this and submit a commit with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making the changes!
final RegExp iterativeParseExp( | ||
Supplier<RegExp> gather, BooleanSupplier stop, MakeRegexGroup reduce) | ||
throws IllegalArgumentException { | ||
Deque<RegExp> regExpStack = new ArrayDeque<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need stack/deque even?
If we need to further reduce call stack, then I think we need a stack that is shared across function calls and some more rewrite.
But here I don't think we need stack and do any FIFO operations? Should be just:
- parse all the sub component
- reduce them
So why not:
RegExp res = null;
do {
RegExp e = gather.get();
if (res == null) {
res = e;
} else {
res = reduce.get(flags, res, e);
}
while (stop.getAsBoolean());
I think this may alter the result a bit by changing it from a | (b | (c | d))
to ((a | b) | c) | d
, but for union
intersect
and concat
the affiliation shouldn't affect the correctness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! In these cases, reduce
is an associative function. If we do this, maybe we can rename it to associativeReduce
or something similar to make this extra assumption obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion Patrick and Stefan.
As long as we do the suggested reduce.get(flags, res, e)
instead of reduce.get(flags, e, res)
the result should be correct.
Will rename reduce
to highlight it's associativeness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new commit with the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slow-J Please add an entry to CHANGES.txt, I'll merge and backport this one then :)
Thank you!
Merged and backported, thank you @slow-J ! |
Parsing regexps no longer raises stack overflows thanks to apache/lucene#12462.
Parsing regexps no longer raises stack overflows thanks to apache/lucene#12462.
Description
Removed recursion from
parseUnionExp
,parseInterExp
andparseConcatExp
methods.This prevents hitting stack overflow errors on long string inputs. Added a unit test to demonstrate.
Closes #11537