Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715

Closed
dweiss opened this issue Jan 27, 2020 · 12 comments
Closed

Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715

dweiss opened this issue Jan 27, 2020 · 12 comments
Assignees
Labels
enhancement Feature requests question More information is needed

Comments

@dweiss
Copy link

dweiss commented Jan 27, 2020

Apache Lucene has a jflex definition file (UAX29URLEmailTokenizerImpl.jflex) that is 21kb and includes some other jflex files that add ~40kb. The generated file (DFA) is 2.2MB so it's still relatively small. The generation process takes a whooping 10 minutes and requires 10 gigs of ram though – most of it is spent in constructing NFA and computing epsilon closures.

I though it'd be interesting to look into why this behaves this way. Perhaps it'd provide some clues for optimization.

You can reproduce construction on Lucene repository (https://github.com/apache/lucene-solr/)

java -jar c:\_tmp\jflex-1.7.0\lib\jflex-full-1.7.0.jar --verbose --encoding "UTF-8" lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerImpl.jflex --skel lucene/core/src/data/jflex/skeleton.disable.buffer.expansion.txt
@dweiss
Copy link
Author

dweiss commented Jan 27, 2020

Seems like most of the problem is caused by this:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/ASCIITLD.jflex-macro

I don't know if there is any other replacement that wouldn't explode the NFA internally...

@dweiss
Copy link
Author

dweiss commented Jan 30, 2020

Right... I think we're experiencing this exactly.

  /**
   * Constructs an NFA accepting the complement of the language of a given NFA.
   *
   * <p>Converts the NFA into a DFA, then negates that DFA. Exponential state blowup possible and
   * common.
...
  private IntPair complement(IntPair nfa) {

@lsf37
Copy link
Member

lsf37 commented Feb 3, 2020

Sorry for the silence, was mostly offline last week. I didn't directly spot any negation operator (!) in ASCIITLD.jflex-macro, but I might have overlooked it. Or is it in a spec that includes this one?

Basically, the NFA->DFA transformation is always potentially exponential, and the negation operator includes one. There are cases that must be exponential, because even the minimal DFA contains exponentially more states than the NFA, and there are cases where it's "just" exponential in the construction. 2.2MB in the final DFA is pretty big, so it sounds like at least some small part of it is not reducible, but compared to 10GB it's tiny, so at least some part should be reducible.

One thing that is notable in ASCIITLD.jflex-macro is that it's a very large | of basically strings. The | constructs an NFA for each part and then connects them all to an initial state by epsilon transitions. This would explain why you are seeing a lot of time spent in epsilon closures, and it also explains the state explosion, because the result of that are basically DFA states that track which set of these alternatives are still possible. The size of these sets is potentially roughly exponential in the number of alternatives and maybe their length, depending on how soon the alternatives diverge from each other.

One thing to try would be to factor out common prefixes, e.g.

| [aA][bB][aA][rR][tT][hH]
| [aA][bB][bB]
| [aA][bB][bB][oO][tT][tT]
| [aA][bB][bB][vV][iI][eE]

=

 [aA][bB] ([aA][rR][tT][hH] | [bB] ([oO][tT][tT] | [vV][iI][eE])?

This should reduce the number of DFA states in initial construction, because the first character in the alternative already distinguishes which one it is. It basically does something that DFA minimisation would take care of later. The spec is big, of course, and manually transforming it will be error prone, so before even attempting it, it would be worth checking out if it helps enough.

The other question is if there is a negation operator on the big expression, if that operator could be somehow avoided.

@lsf37
Copy link
Member

lsf37 commented Feb 3, 2020

Just had a look at UAX29URLEmailTokenizerImpl.jflex: it uses ! to work around the no-macros-in-charclasses restriction. This is now implemented in JFlex master (#654) and will be in the next release. That might well fix your problem.

Are you able to try out the master branch?

@lsf37 lsf37 self-assigned this Feb 3, 2020
@lsf37 lsf37 added the question More information is needed label Feb 3, 2020
@dweiss
Copy link
Author

dweiss commented Feb 3, 2020 via email

@lsf37
Copy link
Member

lsf37 commented Feb 4, 2020

No worries. If the new feature helps, we can probably push out the JFlex 1.8.0 release next week, so you can have the build on an official version.

@dweiss
Copy link
Author

dweiss commented Feb 19, 2020

Hi Gerwin. Nah, I observe the same behavior on master, sadly. Compilation and takes forever and requires huge amounts of ram.

2.2MB in the final DFA is pretty big, so it sounds like at least some small part of it is not reducible, but compared to 10GB it's tiny, so at least some part should be reducible.

Yes, that's my impression too. I bet something could be optimized along the way (maybe at the cost of algorithm clarity) to reduce the number of epsilon transitions before the conversion to a DFA takes place. I wish I had more time to help out!

Tweaking ASCIITLD.jflex-macro is certainly possible although it'll make the definition very complex and hard to understand. As it stands now you can tell what it's doing; I think the compiler should implement this state reduction internally - this shouldn't be too difficult even with a hash-based state deduplication? It's easy for me to say, of course. :)

@lsf37
Copy link
Member

lsf37 commented Feb 20, 2020

Alight, I'll look into it more deeply, surely there must be something we can do. It'll be a bit, though, next week (hopefully) for is the 1.8.0 release first.

@lsf37 lsf37 added the enhancement Feature requests label Feb 20, 2020
@dweiss
Copy link
Author

dweiss commented Feb 20, 2020

No problem at all - this isn't a very pressing matter as we regenerate those automata very infrequently. I was just surprised to see such vast disproportion in memory consumption between generation and final automaton.

The hash state reduction idea (in case you didn't understand my brief note) is about deduplicating states based on an associative container where you can detect nodes with the same input/ output transitions (pointing at the same neighbouring nodes). I glanced at the code and saw bitsets used heavily so it may not be a trivial change...

@lsf37
Copy link
Member

lsf37 commented Feb 20, 2020

Ok, good to know that you're not stuck because of it.

The hash state reduction idea (in case you didn't understand my brief note) is about deduplicating states based on an associative container where you can detect nodes with the same input/ output transitions (pointing at the same neighbouring nodes). I glanced at the code and saw bitsets used heavily so it may not be a trivial change...

Yes, that might be a bit of a project. I think I'll try extracting common factors on the regexp AST first, and see if that has an impact.

@rmuir
Copy link

rmuir commented Dec 4, 2021

Just had a look at UAX29URLEmailTokenizerImpl.jflex: it uses ! to work around the no-macros-in-charclasses restriction. This is now implemented in JFlex master (#654) and will be in the next release. That might well fix your problem.

Are you able to try out the master branch?

@lsf37 I finally tried out the new feature (macros in charclasses) and it's definitely a significant improvement! 285.26 seconds versus 918.87 sec.

So this NFA-complement that was happening due to the workaround was at least amplifying the problem significantly for us, progress.

Thanks!

@lsf37
Copy link
Member

lsf37 commented Jan 6, 2023

Closing this as the maco-in-char-classes feature seems to have alleviated the problem and opened #1026 as a tracking issue for the potential common-prefix optimisation mentioned in this issue.

@lsf37 lsf37 closed this as completed Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests question More information is needed
Projects
None yet
Development

No branches or pull requests

4 participants