set of strings: longest match #25

markusicu · 2021-05-26T23:15:00Z

In the TC39 meeting today (2021-may-26) there was some discussion of how to match character classes that contain multi-character strings, inspired by the slide that showed the examples

[\p{RGI_Emoji}--(🇧🇪)]
[a-zA-Z(ch)(m̀)(か゚)(🇦🇺)(🇧🇪)(🇫🇷)] ≍ [a-zA-Z(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)]

The proposal is to match longest strings first, so that a prefix string does not pre-empt matching a longer string. This needs to be done in runtime semantics after evaluating a set of strings (as a modified CharSet, or as a StringSet, whichever that goes).

In particular, we do not want to match strings in the order that they are written in the regular expression.

Reasons:

A character class defines a set of characters/strings in the mathematical sense: no order, no duplicate elements
The regex spec and its proposed changes are written in terms of set operations.
Source order would be confusing in a character class with unions, intersections, subtractions, and nested classes.
A Unicode property defines a set of characters/strings in the mathematical sense; in particular, no order. Thus, there is no order of the strings in [\p{RGI_Emoji}--(🇧🇪)] that we could preserve.
Implementation experience: ICU class UnicodeSet has supported string literals (though not properties of strings) since 2002. Unicode CLDR has used UnicodeSet syntax nearly that long (e.g., in exemplar character sets and transform/transliteration rules). There has been no discussion or confusion about these being sets in the mathematical sense.

As for the longest match specifically, note that users may have no idea how many Unicode code points it takes to write a “character” like m̀, か゚, 👧🏿, or 🇧🇪 — they just want it to “work”. (I even had a discussion this week with a Slovak colleague who expected there to exist a single-code point way to write "ch".)

The text was updated successfully, but these errors were encountered:

mathiasbynens · 2021-05-27T09:50:24Z

cc @msaboff and @waldemarhorwat

Issue: #25

mathiasbynens · 2021-06-25T12:53:48Z

As discussed in yesterday’s meeting, I’ve kicked off a PR to add this to the FAQ, pointing here for the more detailed rationale that @markusicu has posted. Once the PR lands, I’ll close this issue.

Issue: #25

mathiasbynens · 2021-06-25T20:29:04Z

https://github.com/tc39/proposal-regexp-set-notation#whats-the-match-order-for-character-classes-containing-strings

mathiasbynens added a commit that referenced this issue Jun 25, 2021

Add FAQ entry for match order

b1efecf

Issue: #25

mathiasbynens mentioned this issue Jun 25, 2021

Add FAQ entry for match order #34

Merged

mathiasbynens added a commit that referenced this issue Jun 25, 2021

Add FAQ entry for match order (#34)

50e8216

Issue: #25

mathiasbynens closed this as completed Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set of strings: longest match #25

set of strings: longest match #25

markusicu commented May 26, 2021

mathiasbynens commented May 27, 2021

mathiasbynens commented Jun 25, 2021

mathiasbynens commented Jun 25, 2021

set of strings: longest match #25

set of strings: longest match #25

Comments

markusicu commented May 26, 2021

mathiasbynens commented May 27, 2021

mathiasbynens commented Jun 25, 2021

mathiasbynens commented Jun 25, 2021