Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRegex splitting #2389

Merged
merged 3 commits into from
Jun 26, 2021
Merged

TRegex splitting #2389

merged 3 commits into from
Jun 26, 2021

Conversation

nirvdrum
Copy link
Collaborator

This is a port of Truffle::RegexOperations.match_in_region_tregex from Ruby to Java in order to reduce the number of AST nodes needed, which in turn allows for splitting. Being able to split leads to substantial performance improvements. Using the microbenchmarks from #2381, I have:

Joni for all regexs:

Warming up --------------------------------------
literal characters - no match
                       146.557k i/100ms
literal characters - match
                       268.784k i/100ms
   universal - match   418.954k i/100ms
character class - match
                       416.244k i/100ms
character class - no match
                       454.312k i/100ms
   unique - no match   335.215k i/100ms
      unique - match   370.972k i/100ms
Calculating -------------------------------------
literal characters - no match
                          6.058M (±12.4%) i/s -     29.751M in   5.009168s
literal characters - match
                          4.819M (±10.1%) i/s -     23.922M in   5.032138s
   universal - match      5.047M (± 8.7%) i/s -     25.137M in   5.029133s
character class - match
                          4.790M (± 8.4%) i/s -     24.142M in   5.081252s
character class - no match
                          5.727M (± 8.0%) i/s -     28.622M in   5.036076s
   unique - no match      3.353M (± 6.6%) i/s -     16.761M in   5.021652s
      unique - match      4.631M (±10.6%) i/s -     23.000M in   5.038905s

Comparison:
literal characters - no match:  6058081.9 i/s
character class - no match:  5726703.3 i/s - same-ish: difference falls within error
   universal - match:  5047020.0 i/s - same-ish: difference falls within error
literal characters - match:  4819255.1 i/s - 1.26x  (± 0.00) slower
character class - match:  4789748.9 i/s - 1.26x  (± 0.00) slower
      unique - match:  4631319.4 i/s - 1.31x  (± 0.00) slower
   unique - no match:  3353318.8 i/s - 1.81x  (± 0.00) slower

Truffle Regex with Joni fallback:

Warming up --------------------------------------
literal characters - no match
                       227.018k i/100ms
literal characters - match
                       307.603k i/100ms
   universal - match   262.539k i/100ms
character class - match
                       517.341k i/100ms
character class - no match
                         1.233M i/100ms
   unique - no match     3.771M i/100ms
      unique - match   569.974k i/100ms
Calculating -------------------------------------
literal characters - no match
                         83.620M (± 6.9%) i/s -    414.081M in   4.980405s
literal characters - match
                         55.248M (±14.1%) i/s -    261.770M in   4.993725s
   universal - match      9.618M (±14.5%) i/s -     46.994M in   5.018544s
character class - match
                          8.689M (±17.9%) i/s -     41.905M in   5.033010s
character class - no match
                         74.890M (±20.0%) i/s -    355.090M in   5.017734s
   unique - no match     88.609M (±24.7%) i/s -    414.766M in   4.999963s
      unique - match      5.252M (±22.5%) i/s -     25.079M in   5.000489s

Comparison:
   unique - no match: 88608780.1 i/s
literal characters - no match: 83620213.8 i/s - same-ish: difference falls within error
character class - no match: 74890461.6 i/s - same-ish: difference falls within error
literal characters - match: 55248302.6 i/s - 1.60x  (± 0.00) slower
   universal - match:  9618264.2 i/s - 9.21x  (± 0.00) slower
character class - match:  8689317.1 i/s - 10.20x  (± 0.00) slower
      unique - match:  5251575.9 i/s - 16.87x  (± 0.00) slower

This was run on a MacBook Pro i7 2.6 GHz with macOS 11.3.1.

nirvdrum added 3 commits June 24, 2021 21:30
…one.

The Ruby implementation resulted in too many AST nodes, which was preventing the method from splitting, which had a large negative impact on overall regex performance.
The `raw_bytes` node was used to get a Java `byte[]` out of a Rope as a polyglot object for hooking into Truffle Regex. Since the Truffle Regex interop is now done in Java, there's no need for this node.
@nirvdrum
Copy link
Collaborator Author

One thing to note is there is some Java -> Ruby -> Java boxing of some data types. E.g., SelectEncodingNode returns a RubyRegexp, but in the end all we need is an Encoding. I left most of them figuring EA would take care of them. I did end up adding a new specialization to MatchDataCreateNode to work with an int[] instead of having to wrap that up into a RubyArray. I can back that change out and wrap it as a RubyArray if that's preferable to anyone.

Copy link
Member

@eregon eregon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is there is some Java -> Ruby -> Java boxing of some data types. E.g., SelectEncodingNode returns a RubyRegexp, but in the end all we need is an Encoding. I left most of them figuring EA would take care of them.

I think this is a good opportunity to simplify, and select_encoding is used nowhere else.

Copy link
Member

@eregon eregon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks great.
I'll address my own comments here, as a way to get familiar with the changed code.

@eregon eregon added the in-ci The PR is being tested in CI. Do not push new commits. label Jun 25, 2021
graalvmbot pushed a commit that referenced this pull request Jun 26, 2021
PullRequest: truffleruby/2767
@graalvmbot graalvmbot merged commit e11975b into oracle:master Jun 26, 2021
@eregon eregon added this to the 21.3.0 milestone Jun 28, 2021
@tomstuart tomstuart deleted the tregex-splitting branch June 28, 2021 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-ci The PR is being tested in CI. Do not push new commits. oca-signed shopify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants