Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Regex::CompileOptions::MULTILINE_ONLY #14870

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

ralsina
Copy link

@ralsina ralsina commented Aug 6, 2024

Adds a new MULTILINE_ONLY flag to regexes.

I created an issue here:

#14869

As discussed here:

https://forum.crystal-lang.org/t/regex-that-is-multiline-but-not-dotall-how/7054/12

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: The translation needs to be implemented in pcre_compile_options as well.

Also would be nice to have some specs for testing the behaviour.

src/regex.cr Outdated Show resolved Hide resolved

# Multiline matching.
#
# Equivalent to `MULTILINE | DOTALL` in PCRE and PCRE2.
MULTILINE = 0x0000_0006
# Equivalent to `MULTILINE`in PCRE and PCRE2.
MULTILINE_ONLY = 0x0000_0040
Copy link
Member

@straight-shoota straight-shoota Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Theoretically, we're free to chose whatever values we want for this enum because they're specific to Crystal and get translated for any engine implementation. But since all other members are based on the libpcre1 values, we should follow this scheme:

Suggested change
MULTILINE_ONLY = 0x0000_0040
MULTILINE_ONLY = 0x0000_0002

Also this should ensure that MULTILINE_ONLY | DOTALL == MULTILINE which is a useful property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@straight-shoota - Note that DOTALL 2 lines below is already 0x0000_0002, so if we change this value, do we have to change the DOTALL value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this should ensure that MULTILINE_ONLY | DOTALL == MULTILINE which is a useful property.

It's probably worth adding that as an explicit test and/or encoding that directly in the definition of MULTILINE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh the DOTALL = 2 is surprising. Guess that never worked correctly when it was mapped directly to libpcre1 values 🙈

I suppose it's probably best to add MULTILINE_ONLY = 4 then. This breaks value equivalence with libpcre1, but keeps the Crystal values consistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using the MULTILINE without DOT_ALL value at first but it also breaks inspect (it gets confused and thinks MULTILINE_ONLY is MULTILINE)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work (somehow). If it doesn't, there's something wrong. Worst case we override the implementation of CompileOptions#inspect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, changing the values could break existing code, even if I doubt anyone is instantiating an Options[4] or whatever :-D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's unlikely. But better to keep this compatibility in case someone relies on the integer values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to 0x4 then. It'll be the MULTILINE and DOTALL values of PCRE1 swapped but still ensure MULTILINE = MULTILINE_ONLY | DOTALL.

Suggested change
MULTILINE_ONLY = 0x0000_0040
MULTILINE_ONLY = 0x0000_0004

src/regex.cr Outdated Show resolved Hide resolved
src/regex/pcre2.cr Outdated Show resolved Hide resolved
src/regex.cr Show resolved Hide resolved
Co-authored-by: Sijawusz Pur Rahnama <sija@sija.pl>
@straight-shoota straight-shoota changed the title Add MULTILINE_ONLY flag for regex Add Regex::CompileOptions::MULTILINE_ONLY Aug 6, 2024
@ralsina
Copy link
Author

ralsina commented Aug 8, 2024

Re-added the special handling for MULTILINE (setting MULTILINE and DOTALL) because removing it broke a test.

The remaining test failure happens on main too (at spec/std/regex_spec.cr:19)

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case for MULTILINE is only necessary because the value of MULTILINE_ONLY is not a composite of MULTILINE. Change it to 0x0000_0004 and it should work.

And the test failure is in fact introduced by this change because the test uses 0x0000_0040 as an undefined option value (but is now defined).


# Multiline matching.
#
# Equivalent to `MULTILINE | DOTALL` in PCRE and PCRE2.
MULTILINE = 0x0000_0006
# Equivalent to `MULTILINE`in PCRE and PCRE2.
MULTILINE_ONLY = 0x0000_0040
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to 0x4 then. It'll be the MULTILINE and DOTALL values of PCRE1 swapped but still ensure MULTILINE = MULTILINE_ONLY | DOTALL.

Suggested change
MULTILINE_ONLY = 0x0000_0040
MULTILINE_ONLY = 0x0000_0004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

It's impossible to create a regex that is MULTILINE but not DOT_ALL
5 participants