-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715
Comments
Seems like most of the problem is caused by this: I don't know if there is any other replacement that wouldn't explode the NFA internally... |
Right... I think we're experiencing this exactly.
|
Sorry for the silence, was mostly offline last week. I didn't directly spot any negation operator ( Basically, the NFA->DFA transformation is always potentially exponential, and the negation operator includes one. There are cases that must be exponential, because even the minimal DFA contains exponentially more states than the NFA, and there are cases where it's "just" exponential in the construction. 2.2MB in the final DFA is pretty big, so it sounds like at least some small part of it is not reducible, but compared to 10GB it's tiny, so at least some part should be reducible. One thing that is notable in ASCIITLD.jflex-macro is that it's a very large One thing to try would be to factor out common prefixes, e.g.
=
This should reduce the number of DFA states in initial construction, because the first character in the alternative already distinguishes which one it is. It basically does something that DFA minimisation would take care of later. The spec is big, of course, and manually transforming it will be error prone, so before even attempting it, it would be worth checking out if it helps enough. The other question is if there is a negation operator on the big expression, if that operator could be somehow avoided. |
Just had a look at Are you able to try out the master branch? |
I am on short holidays this week, thanks for looking into it. I will get back to this once I am back. D.
|
No worries. If the new feature helps, we can probably push out the JFlex 1.8.0 release next week, so you can have the build on an official version. |
Hi Gerwin. Nah, I observe the same behavior on master, sadly. Compilation and takes forever and requires huge amounts of ram.
Yes, that's my impression too. I bet something could be optimized along the way (maybe at the cost of algorithm clarity) to reduce the number of epsilon transitions before the conversion to a DFA takes place. I wish I had more time to help out! Tweaking ASCIITLD.jflex-macro is certainly possible although it'll make the definition very complex and hard to understand. As it stands now you can tell what it's doing; I think the compiler should implement this state reduction internally - this shouldn't be too difficult even with a hash-based state deduplication? It's easy for me to say, of course. :) |
Alight, I'll look into it more deeply, surely there must be something we can do. It'll be a bit, though, next week (hopefully) for is the 1.8.0 release first. |
No problem at all - this isn't a very pressing matter as we regenerate those automata very infrequently. I was just surprised to see such vast disproportion in memory consumption between generation and final automaton. The hash state reduction idea (in case you didn't understand my brief note) is about deduplicating states based on an associative container where you can detect nodes with the same input/ output transitions (pointing at the same neighbouring nodes). I glanced at the code and saw bitsets used heavily so it may not be a trivial change... |
Ok, good to know that you're not stuck because of it.
Yes, that might be a bit of a project. I think I'll try extracting common factors on the regexp AST first, and see if that has an impact. |
@lsf37 I finally tried out the new feature (macros in charclasses) and it's definitely a significant improvement! 285.26 seconds versus 918.87 sec. So this NFA-complement that was happening due to the workaround was at least amplifying the problem significantly for us, progress. Thanks! |
Closing this as the maco-in-char-classes feature seems to have alleviated the problem and opened #1026 as a tracking issue for the potential common-prefix optimisation mentioned in this issue. |
Apache Lucene has a jflex definition file (
UAX29URLEmailTokenizerImpl.jflex
) that is 21kb and includes some other jflex files that add ~40kb. The generated file (DFA) is 2.2MB so it's still relatively small. The generation process takes a whooping 10 minutes and requires 10 gigs of ram though – most of it is spent in constructing NFA and computing epsilon closures.I though it'd be interesting to look into why this behaves this way. Perhaps it'd provide some clues for optimization.
You can reproduce construction on Lucene repository (https://github.com/apache/lucene-solr/)
The text was updated successfully, but these errors were encountered: