-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegExp - add case insensitive matching option #1541
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments but looks great @markharwood
lucene/core/src/test/org/apache/lucene/util/automaton/TestRegExp.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the flag should be renamed UNICODE_CASE_INSENSITIVE
since it handles code points and not only US-ASCII charset like the java's Pattern flag.
The flags that can be passed to the constructor are about the supported operators. Case insentivity is not an operator so it feels wrong to ask users to configure it this way too? I think it's fine to record this information in the flags internally, but maybe we should make the constructor take an additional boolean instead of expecting users to configure case insensitivity via a syntax flag? |
Does it change things if we consider Java's case insensitivity is also a bit mask flag passed to the constructor? |
On reflection, you're right - the single flag is trappy.
This should solve the problem of propagating settings and give us a sounder footing to build on. One issue is that this would technically be a breaking change as https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-9371 opened up the internal state of the parser and we will change the class of nodes. |
@markharwood Java's case-insensitivity flag is indeed a bit mask flag passed to the constructor, but my comment was more about the fact that the current flags on RegExp are about what operators are supported, while the bit you are adding controls how matching works. So in my opinion, it should be a different constructor argument (boolean or other bit set depending on whether we're seeing other desirable ways to control how matching works). We could merge both bitsets internally if that makes things easier, my concern is only about the API. I don't have an opinion about the RegExp split, but I don't feel bad about propagating information recursively like in your PR. |
I updated the constructor on this @jpountz - good to go? |
7bc5c1d
to
def0f8e
Compare
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
Outdated
Show resolved
Hide resolved
990c3d0
to
750d612
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one minor comment, LGTM otherwise
/** | ||
* Allows case insensitive matching of ASCII characters. | ||
*/ | ||
public static final int ASCII_CASE_INSENSITIVE = 0x0100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call it CASE_INSENSITIVE since we want to leave the door for another flag that would control if unicode should be handled fully ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it might be useful if the flag name reflected the current limitations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then what would be the other flag to allow unicode support ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNICODE_CASE_INSENSITIVE or just CASE_INSENSITIVE?
Either way it would cover ASCII and all other UNICODE characters.
Admittedly slightly odd that the two flags overlap but the alternative is people may assume that a non-qualified name like "CASE_INSENSITIVE" would cover all the bases when we only currently support ASCII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we ensure that the flag is not used in the syntax_flags since we merge the two internally ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption was this class was lenient to syntax flags > 0xff before this change so should remain so for BWC reasons
*/ | ||
public RegExp(String s, int syntax_flags, int match_flags) throws IllegalArgumentException { | ||
// (for BWC reasons we don't validate invalid bits, just trim instead) | ||
syntax_flags = syntax_flags & 0xff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to maintain bw compat for this, is there any test that fails if you remove this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, no. The change would be to remove that line and replace with
if (syntax_flags > ALL) {
throw new IllegalArgumentException("Illegal syntax flag");
}
…ructors and for loop tweak
…e boolean to allow for future developments.
Reordered constructor args and added arg validation.
Relates to Jira issue 9386
Added a new CASE_INSENSITIVE option to the existing flags.
The RegExp class is a little strange because instances represent either the parser or the parsed objects it nests in a tree. The
flags
field is only relevant to the root parser and was left blank in all parsed nodes. This PR's changes require that the flags int is propagated to all nodes so that they can see if it includes the case insensitive option (all other bits in the flag represent parsing options so there was no need to propagate before).