RegExp - add case insensitive matching option #1541

markharwood · 2020-05-28T13:40:59Z

Relates to Jira issue 9386

Added a new CASE_INSENSITIVE option to the existing flags.
The RegExp class is a little strange because instances represent either the parser or the parsed objects it nests in a tree. The flags field is only relevant to the root parser and was left blank in all parsed nodes. This PR's changes require that the flags int is propagated to all nodes so that they can see if it includes the case insensitive option (all other bits in the flag represent parsing options so there was no need to propagate before).

jimczi

I left some minor comments but looks great @markharwood

lucene/core/src/test/org/apache/lucene/util/automaton/TestRegExp.java

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

jimczi

I wonder if the flag should be renamed UNICODE_CASE_INSENSITIVE since it handles code points and not only US-ASCII charset like the java's Pattern flag.

jpountz · 2020-05-28T21:24:16Z

The flags that can be passed to the constructor are about the supported operators. Case insentivity is not an operator so it feels wrong to ask users to configure it this way too? I think it's fine to record this information in the flags internally, but maybe we should make the constructor take an additional boolean instead of expecting users to configure case insensitivity via a syntax flag?

markharwood · 2020-05-29T11:06:28Z

maybe we should make the constructor take an additional boolean instead of expecting users to configure case insensitivity via a syntax flag?

Does it change things if we consider Java's case insensitivity is also a bit mask flag passed to the constructor?

markharwood · 2020-06-01T08:51:47Z

On reflection, you're right - the single flag is trappy.
I'd like to refactor this class to make this simpler. The root problem we have is propagating parser state (flags/options) down to the objects that represent clauses in the parse tree. This is made difficult by the fact that RegExp is a single class representing both the parser and the parsed nodes.
I suggest refactoring so that :

RegExp remains the user-facing class with the public constructor and has the parsing logic
We use a new private class RegExpClause to hold clause state, but being an inner class it has access to the flags in the outer RegExp instance that contains it.

This should solve the problem of propagating settings and give us a sounder footing to build on.
Should I do this refactor as part of this PR or another @jpountz ?

One issue is that this would technically be a breaking change as https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-9371 opened up the internal state of the parser and we will change the class of nodes.

jpountz · 2020-06-02T12:18:27Z

@markharwood Java's case-insensitivity flag is indeed a bit mask flag passed to the constructor, but my comment was more about the fact that the current flags on RegExp are about what operators are supported, while the bit you are adding controls how matching works. So in my opinion, it should be a different constructor argument (boolean or other bit set depending on whether we're seeing other desirable ways to control how matching works). We could merge both bitsets internally if that makes things easier, my concern is only about the API.

I don't have an opinion about the RegExp split, but I don't feel bad about propagating information recursively like in your PR.

markharwood · 2020-06-04T10:56:24Z

I updated the constructor on this @jpountz - good to go?

lucene/core/src/java/org/apache/lucene/search/RegexpQuery.java

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

markharwood · 2020-06-25T09:16:52Z

Thanks for the reviews @jpountz and @jimczi
I think I've addressed all the review comments now if you have a chance to take another look

jimczi

I left one minor comment, LGTM otherwise

jimczi · 2020-06-25T10:59:53Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

+  /**
+   * Allows case insensitive matching of ASCII characters.
+   */
+  public static final int ASCII_CASE_INSENSITIVE = 0x0100;    


Let's call it CASE_INSENSITIVE since we want to leave the door for another flag that would control if unicode should be handled fully ?

I thought it might be useful if the flag name reflected the current limitations?

But then what would be the other flag to allow unicode support ?

UNICODE_CASE_INSENSITIVE or just CASE_INSENSITIVE?
Either way it would cover ASCII and all other UNICODE characters.
Admittedly slightly odd that the two flags overlap but the alternative is people may assume that a non-qualified name like "CASE_INSENSITIVE" would cover all the bases when we only currently support ASCII.

ok fine by me.

should we ensure that the flag is not used in the syntax_flags since we merge the two internally ?

My assumption was this class was lenient to syntax flags > 0xff before this change so should remain so for BWC reasons

markharwood · 2020-07-06T12:51:29Z

OK to merge this @jpountz ?
I had one last comment re flag validation

jpountz · 2020-07-07T15:04:48Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

+   */
+  public RegExp(String s, int syntax_flags, int match_flags) throws IllegalArgumentException {    
+    // (for BWC reasons we don't validate invalid bits, just trim instead)
+    syntax_flags  = syntax_flags & 0xff;


I don't think we need to maintain bw compat for this, is there any test that fails if you remove this line?

As far as I can see, no. The change would be to remove that line and replace with

if (syntax_flags > ALL) { throw new IllegalArgumentException("Illegal syntax flag"); }

lucene/core/src/java/org/apache/lucene/search/RegexpQuery.java

…ructors and for loop tweak

…e boolean to allow for future developments.

Reordered constructor args and added arg validation.

markharwood self-assigned this May 28, 2020

markharwood added the enhancement label May 28, 2020

markharwood requested a review from jimczi May 28, 2020 13:41

jimczi reviewed May 28, 2020

View reviewed changes

markharwood mentioned this pull request May 29, 2020

Support case insensitive search on new wildcard field and keyword elastic/elasticsearch#53603

Closed

markharwood force-pushed the fix/9386 branch 3 times, most recently from 7bc5c1d to def0f8e Compare June 16, 2020 13:01

jpountz requested changes Jun 16, 2020

View reviewed changes

markharwood force-pushed the fix/9386 branch 2 times, most recently from 990c3d0 to 750d612 Compare June 24, 2020 16:30

jimczi approved these changes Jun 25, 2020

View reviewed changes

markharwood force-pushed the fix/9386 branch from 750d612 to ef95884 Compare July 6, 2020 10:48

jpountz reviewed Jul 7, 2020

View reviewed changes

jpountz approved these changes Jul 8, 2020

View reviewed changes

markharwood added 7 commits July 8, 2020 15:45

Added case insensitive search option

5bcaac7

Addressing review comments

07f9403

Changed case sensitive flag to a boolean in constructors.

16c198a

Reduced visibility of case insensitive flag as a non-user-facing flag

fce454a

Expose case sensitivity options in RegExpQuery

0857aad

Revert irrelevant javadoc change

678a572

Addressed review comments - ASCII-only case changes, reduce num const…

bb1bf20

…ructors and for loop tweak

markharwood added 3 commits July 8, 2020 15:45

Changed case sensitivity options to be a bit mask rather than a simpl…

fff3712

…e boolean to allow for future developments.

Remove superfluous constructor

52788bc

Addressed review comments (thanks @jpountz !)

e4424fe

Reordered constructor args and added arg validation.

markharwood force-pushed the fix/9386 branch from e2fae15 to e4424fe Compare July 8, 2020 14:46

markharwood merged commit 887fe4c into apache:master Jul 8, 2020

markharwood mentioned this pull request Aug 14, 2020

Option for case insensitive search at runtime elastic/elasticsearch#61162

Closed

7 tasks

cbuescher mentioned this pull request Apr 25, 2023

Support Case Insensitive search for foreign characters in wildcard field type elastic/elasticsearch#95120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegExp - add case insensitive matching option #1541

RegExp - add case insensitive matching option #1541

markharwood commented May 28, 2020

jimczi left a comment

jimczi left a comment

jpountz commented May 28, 2020

markharwood commented May 29, 2020

markharwood commented Jun 1, 2020 •

edited

Loading

jpountz commented Jun 2, 2020

markharwood commented Jun 4, 2020

markharwood commented Jun 25, 2020

jimczi left a comment

jimczi Jun 25, 2020

markharwood Jun 25, 2020

jimczi Jun 25, 2020

markharwood Jun 25, 2020

jimczi Jun 25, 2020

jimczi Jun 25, 2020

markharwood Jun 25, 2020

markharwood commented Jul 6, 2020

jpountz Jul 7, 2020

markharwood Jul 8, 2020

RegExp - add case insensitive matching option #1541

RegExp - add case insensitive matching option #1541

Conversation

markharwood commented May 28, 2020

jimczi left a comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

jpountz commented May 28, 2020

markharwood commented May 29, 2020

markharwood commented Jun 1, 2020 • edited Loading

jpountz commented Jun 2, 2020

markharwood commented Jun 4, 2020

markharwood commented Jun 25, 2020

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markharwood commented Jul 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markharwood commented Jun 1, 2020 •

edited

Loading