Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex part deux - INTERPOLATION_SYNTAX #669

Merged
merged 3 commits into from
Jun 21, 2023
Merged

Conversation

kbrock
Copy link
Contributor

@kbrock kbrock commented Jun 13, 2023

Thanks for the great gem.

I was curious what I could do with INTERPOLATION_SYNTAX.

This has 3 commits.

  • The commit from the regex Improve TOKENIZER by 23% #668
  • An update to INTERPOLATION_SYNTAX with minor ruby changes.
  • Dropping INTERPOLATION_SYNTAX and just using TOKENIZER.

I ran tests with ruby 2.6.9 and 3.0.6

Not sure when the syntax change for the substring was introduced str[1..].
rubocop suggested I change my str[1..-1] over to that.
They also said the backslash in [^\}] was not necessary.

Let me know if you would like to keep INTERPOLATION_SYNTAX and I can throw away the second commit. Or if you like it, I can squash the two.
Something was nice about the multiple capture groups in the regular expression, but I didn't feel the complexity (from pcre's perspective) bought too much. But since this is your project, it is your call.

Also in reference to #667

As I started to run numbers, I'm feeling less and less like this is a DoS. So maybe I'm not the right person to state an opinion on whether these changes are necessary.

From the commit messages

/(%)?(%\{([^\}]+)\})/ =~ '%{{'*9999)+'}'

/(%)?(%\{([^\}]+)\})/ ==> 199,984 steps
/(%%?)\{([^\}]+)\}/   ==> 129,989 steps

/(%%?\{[^\}]+\})/     ==>  99,992 steps

But that hasn't reached the TOKENIZER performance, so the second commit went with that one:

/(%%?\{[^\}]+\})/     ==>  99,992 steps

From what I can see, this is done in linear time: 4*O(n)
This tokenizer change converts that to something a little quicker: 3*O(n)

Seems that not using a capture group and something other than split would be a big win.
Other than that, the changes were meager.

I used https://regex101.com/ (and pcre2) to evaluate the cost of the TOKENIZER.
I verified with cruby 3.0.6

```
 /(%%\{[^\}]+\}|%\{[^\}]+\})/ =~ '%{{'*9999)+'}'

/(%%\{[^\}]+\}|%\{[^\}]+\})/ ==> 129,990 steps
/(%?%\{[^\}]+\})/            ==> 129,990 steps
/(%%?\{[^\}]+\})/            ==>  99,992 steps (simple savings of 25%) <===
/(%%?\{[^%}{]+\})/           ==>  89,993 steps (limiting variable contents has minimal gains)
```

Also of note are the null/simple cases:

```
/x/ =~ '%{{'*9999)+'}'
/x/                          ==>  29,998 steps
/(x)/                        ==>  59,996 steps
/%{x/                        ==>  49,998 steps
/(%%?{x)/                    ==>  89,993 steps
```

And comparing against a the plain string of the same length.

```
/x/ =~ 'abb'*9999+'c'

/x/                          ==>  29,999
/(%%?{x)/                    ==>  59,998
/(%%?\{[^\}]+\})/            ==>  59,998
/(%%\{[^\}]+\}|%\{[^\}]+\})/ ==>  89,997
```

per ruby-i18n#667
same as tokenizer change:
from regex101.com pcre2 debugger:

```ruby

/(%)?(%\{([^\}]+)\})/ =~ '%{{'*9999)+'}'

/(%)?(%\{([^\}]+)\})/ ==> 199,984 steps
/(%%?)\{([^\}]+)\}/   ==> 129,989 steps

/(%%?\{[^\}]+\})/     ==>  99,992 steps
```

So the extra capture group is the main hit.
@kbrock
Copy link
Contributor Author

kbrock commented Jun 13, 2023

come to think of it, may be able to skip using this regular expression at all. or if using it, skip on the capture group all together. But feeling this is way overkill, especially since we are in linear time.

@radar
Copy link
Collaborator

radar commented Jun 21, 2023

Not sure when the syntax change for the substring was introduced str[1..].

Ruby 2.6.

This is currently the earliest version of Ruby that i18n supports, so I think it is safe.

@radar
Copy link
Collaborator

radar commented Jun 21, 2023

I like it! Simpler regular expressions will always get my vote.

@radar radar merged commit 7cf0947 into ruby-i18n:master Jun 21, 2023
@kbrock kbrock deleted the regex2 branch July 13, 2023 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants