Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addresses issue #188 #314

Merged
merged 4 commits into from
Jun 17, 2024
Merged

addresses issue #188 #314

merged 4 commits into from
Jun 17, 2024

Conversation

timbray
Copy link
Owner

@timbray timbray commented Jun 9, 2024

prevent state explosions with epsilon transitions

The introduction of proper NFA epsilon transitions - see https://swtch.com/~rsc/regexp/regexp1.html - totally fixed this problem. The most dramatic illustration is line 96 of shell_style_test.go. Previously, you could only use 20 or so shell_style patterns before an O(2**N) explosion leading to millions of states. Now you can add all 12K in the test file. The addPattern() runs almost instantly, although matching slows down, a fair trade-off.

Once this is landed, we can start implementing all the cool stuff that was-ruler has without worry about exploding NFAs.

prevent state explosions with epsilon transitions

Signed-off-by: Tim Bray <tbray@textuality.com>
@codecov-commenter
Copy link

codecov-commenter commented Jun 9, 2024

Codecov Report

Attention: Patch coverage is 95.23810% with 9 lines in your changes missing coverage. Please review.

Project coverage is 96.10%. Comparing base (5eee82d) to head (37b33ab).

Files Patch % Lines
prettyprinter.go 70.58% 4 Missing and 1 partial ⚠️
nfa.go 94.00% 2 Missing and 1 partial ⚠️
core_matcher.go 92.30% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #314      +/-   ##
==========================================
- Coverage   96.39%   96.10%   -0.30%     
==========================================
  Files          18       18              
  Lines        1718     1744      +26     
==========================================
+ Hits         1656     1676      +20     
- Misses         35       40       +5     
- Partials       27       28       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@timbray
Copy link
Owner Author

timbray commented Jun 9, 2024

Oh drat, I somehow missed go test -v -race -count=1 looks like a real race condition. Please ignore this PR for now.

Signed-off-by: Tim Bray <tbray@textuality.com>
@timbray
Copy link
Owner Author

timbray commented Jun 9, 2024

You may have noticed a bit of thrashing in this PR

There's an important trade-off to consider.

Performance

To summarize, #314 cleanly kills issue #188 which means that you can freely use "*" globs, as many as you want, and you won't get an O(2**N) explosion in your memory size and addPattern() latency. I'll also add the ability to have multiple wildcards in a single pattern and that will be fine too. Now we have a clear path to doing full regular expressions in Quamina.

However, there is a cost. #314 adds "epsilon transitions" and multi-state traversal to our automata, which solves a lot of problems but has a performance cost. The latest versions, compared to the current main branch on GitHub, is 20% slower on exact-match, 15% slower on prefix, and 10% slower on anything-but. (The most recent push is much worse but I've fixed that in my version, back to around 20%). I care about performance a lot, this is bad.

On the other hand:

  1. It's still a lot faster than aws/event-ruler
  2. It still does millions of matches/second
  3. The new epsilon-transition path hasn't received the kind of intense optimization that the existing one has.

Question for the audience: Should we go ahead and accept the performance penalty in exchange for opening the gates to lots of nice features? Is there anyone now using Quamina who's going to be troubled by a ~20% performance hit?

Signed-off-by: Tim Bray <tbray@textuality.com>
@jsmorph
Copy link
Collaborator

jsmorph commented Jun 10, 2024

That performance hit won't cause me any heartburn.

Does this work help at all with a future numeric implementation?

Signed-off-by: Tim Bray <tbray@textuality.com>
@timbray
Copy link
Owner Author

timbray commented Jun 16, 2024

Does this work help at all with a future numeric implementation?

I'm not sure. Because of the way smallTables work, I don't think there any need for nondeterministic automata to do that. Which in fact we managed to do with Ruler back in the day.

@timbray
Copy link
Owner Author

timbray commented Jun 16, 2024

Hearing no objections… Once again, I acknowledge that potential reviewers are probably not eager about digging through finite-automata theory, so unless someone screams STOP I'll push this Monday.

@timbray timbray merged commit 21b3414 into main Jun 17, 2024
7 checks passed
@timbray timbray deleted the issue-188 branch June 17, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants