Improve regexes #666

slevithan · 2024-06-20T00:41:41Z

I didn't audit all the regexes. Just the specific ones described below.

EMOJI_REGEX

Previous version matched lots of false positives like digits (0, etc.), *, bare U+200D (ZWJ), and some symbols like 👁, ✈, 🏳, and ♂ even when they're not followed by U+FE0F (none of which should match). So I fixed it.
See some examples I put together at https://regex101.com/r/WKFYNE/1
One open question I have is whether all enclosing marks are allowed where I used \p{Me}, or if it's a more limited set like just U+20E3. The segment in question (\p{Emoji}\uFE0F\p{Me}?) is used to match emoji like 2️⃣ which is made up of 3 code points: U+32 U+FE0F U+20E3.

HEXADECIMAL_REGEX

Replaced 0h|0x with 0[hx].

IPV4_REGEX

Replaced each byte segment (?:(?:[1-9]|1\d|2[0-4])?\d|25[0-5]) with (?:2[0-4]\d|25[0-5]|1\d\d|[1-9]?\d). IMO it's easier to read without the nested grouping, and it's the same length. Then replaced \d\d with \d{2} to work around an eslint error that I don't agree with.
Used interpolation to not repeat the byte subpattern.

IPV6_REGEX

No changes except replacing the two IPv4 subpatterns with interpolation. There's probably still room for improvement, at least if you used lookahead to help remove some of the redundancy.

IP_REGEX

654 characters of a regex literal to 41 chars using a dynamic regex with simple composition.

fabian-hiller · 2024-06-20T16:01:41Z

Thank you!

One open question I have is whether all enclosing marks are allowed where...

What do you recommend?

Used interpolation to not repeat the byte subpattern.

I had implemented this the same way before, but some bundlers had trouble tree-shaking it and including the regex in every bundle, whether used or not. So I repeated it. Due to compression it is probably also better in terms of bundle size.

slevithan · 2024-06-20T16:38:04Z

One open question I have is whether all enclosing marks are allowed where...

What do you recommend?

I recommend using it as written, with \p{Me} instead of \u20E3. It won't create problems and might allow for edge-case emoji I can't think of or future emoji. But I will follow up with additional investigation.

Used interpolation to not repeat the byte subpattern.

I had implemented this the same way before, but some bundlers had trouble tree-shaking it and including the regex in every bundle, whether used or not. So I repeated it. Due to compression it is probably also better in terms of bundle size.

Fair enough, if bundle size trumps readability. Another solution would be to use the regex package since its subroutines syntax allows subpattern composition without interpolation. But I'm guessing you won't think the dependency (despite being lightweight) is the right tradeoff since you currently advertise as having no dependencies.

fabian-hiller · 2024-06-20T18:53:29Z

Thanks you for your feedback! It may take some time to finish this PR as other things have a higher priority at the moment.

slevithan · 2024-06-28T16:48:36Z

I've done a lot more research on the emoji regex and published an improved/fixed version as emoji-regex-xs. I'll update the PR here shortly, and also remove all the use of composition in the regexes since it sounds like you'd prefer to avoid that.

fabian-hiller · 2024-06-28T17:40:35Z

Thank you very much! I will probably review and merge this PR next week.

slevithan · 2024-07-02T13:54:09Z

OK, updated. Passes tests and lint.

slevithan · 2024-07-03T08:45:59Z

In the last-added commit, I updated the comment about RGI_Emoji, pointing out that it's not a better replacement, even in the future when targeting Node.js 20+. The reasons are discussed in the emoji-regex-xs readme, but I'll also describe them here.

To clarify what the emoji regex in this PR matches:

It matches everything matched by ES2024's /\p{RGI_Emoji}/v.
It excludes non-emoji symbols, dingbats, "emoji components" (like digits, *, #, and ZWJ), etc.
- This is where the emoji regex prior to this PR fails.
It intentionally matches additional non-RGI emoji.

On that last point, unfortunately, some common-sense and broadly-supported emoji are not officially in the "RGI" list. Even the Unicode org provides emoji-test.txt that mixes in non-RGI emoji strings to help identify real-world emoji. And some emoji are commonly used in an underqualified or overqualified way (by including or excluding certain invisible Unicode markers) that prevents them from being matched by RGI_Emoji. For example, the iOS emoji keyboard overqualifies certain emoji. So we need something that matches everything in RGI_Emoji, and more.

The regex here allows overqualified and underqualified emoji using a general pattern that matches all Unicode sequences that follow the structure of valid emoji.

fabian-hiller · 2024-07-05T12:07:00Z

Thank you for your research! Is the new emoji regex more strict or accurate than the old one?

slevithan · 2024-07-06T08:45:26Z

Yes. If by "the old one" you meant my own iterations, this is the final version based on research in depth, and I've now therefore also published it as its own library (emoji-regex-xs). If there are changes to emoji-regex-xs in the future (e.g., if new versions of Unicode modify the general patterns for emoji), it can easily be updated by anyone, by simply copying the pattern from future versions of emoji-regex-xs and wrapping it in ^(?:…)+$. The wrapping pattern is because Valibot's emoji regex matches one or more emoji, as the whole string.

On emoji-regex-xs

This library shares the API and 3,000+ tests with emoji-regex. emoji-regex is very large (13 KB uncompressed), but it is authoritative (its author helped add things like \p{}, RGI_Emoji, and /v to JS regexes) and extremely broadly used. emoji-regex now also recommends and crosslinks emoji-regex-xs in its readme, for people who want a general pattern (not tied to a specific version of Unicode) that can therefore be much more lightweight.

Problems with Valibot's current emoji regex

If by "the old one" you meant the regex currently used in the code that this PR replaces, then yes, this PR is both more strict and more accurate.

Here is the current regex that is being replaced:

/^[\p{Extended_Pictographic}\p{Emoji_Component}]+$/u

This is extremely wrong. I'm assuming you picked it up from Zod, which uses the same thing, and I can find other people posting it online, which is where Zod probably got it from. It presumably has spread virally because:

It indeed matches any sequence of one or more emoji.
Emoji are complicated and most people aren't aware of what specific Unicode properties like \p{Emoji_Component} match, or the full variation of different Unicode pattern sequences needed to match the full set of emoji, much less to also deal with real-world data (where some additional variation is added).
Until now (with emoji-regex-xs, which came about based on my work on this PR 😊), there simply has not been a lightweight JS regex pattern I've seen anywhere online that approaches emoji-regex-xs's accuracy and robustness. So, many people settled for (or rolled their own) incomplete/broken patterns, or they used the excellent but very large emoji-regex that hardcodes thousands of code points. Or, in the last year, they started using ES2024's \p{RGI_Emoji}, but that has some problems as well which is why both emoji-regex and emoji-regex-xs match supersets of what it matches.

There are two big problems with the regex that this PR replaces:

It matches many things that are not emoji.
It cannot be used (or modified) to match individual emoji, because its reliance on \p{Emoji_Component} means it cannot possibly identify where where one emoji ends and the next starts. It only works at all because Valibot's requirement for this regex is to match one or more back-to-back emoji.
- I find this one-or-more emoji validation strange in the first place, since none of the other validation regexes in regex.ts match one or more of their target. I suspect (without evidence) that this anomaly started because this pattern simply can't be modified to accurately match a single, complete emoji, and the authors of Zod couldn't find a reliable yet short alternative for doing so.

Regarding the first problem, I already mentioned some of its false positives in earlier comments, but here are some additional details (not comprehensive):

It incorrectly matches * (U+002A) and # (U+0023).
It incorrectly matches ASCII digits (0, 1, 234, etc.).
It incorrectly matches a large number of symbols and dingbats that are not emoji. An example is ✈ (U+2708).
- Some of these also have versions that are emoji and should be matched. Ex: ✈️ (U+2708 U+FE0F).
It incorrectly matches a variety of Unicode format characters, enclosing marks, nonspacing marks, and other modifiers, some of which have purposes beyond emoji, and all of which are certainly not emoji on their own. Examples include U+200D, U+20E3, U+FE0F, and U+E007F.
It incorrectly matches all regional indicator symbol letters A-Z, tag digits 0-9, and tag Latin small letters a-z (collectively matched by /[\u{1F1E6}-\u{1F1FF}\u{E0030}-\u{E0039}\u{E0061}-\u{E007A}]/u). These can be composed into emoji flags and (with additional markers) subdivision flags, but they are not emoji on their own.

The emoji regex in this PR fixes all of these issues.

fabian-hiller · 2024-07-06T18:30:24Z

Thanks again for your research and detailed answer! I thought it would be the best DX if emoji could be combined with minLength, maxLength and length to control the number of emoji. Unfortunately, this does not work because emoji can have different lengths. An alternative would be to control this via the arguments of emoji and dynamically bake it into the regex. What do you think is the best solution?

slevithan · 2024-07-06T21:22:09Z

I'm not familiar with Valibot's APIs. Could minLength, maxLength, and length not be sensibly used to refer to the number of discrete emoji to match, rather than the number of code points (returned by [...str].length) or code units (str.length) in the matched string? If they could, then both of the APIs you described sound reasonable to me, but I might be more opinionated if you showed code examples of how the two approaches would be used.

Certainly, with the new emoji regex, something like this could easily be done. You'd just need to change the + quantifier in the ^(?:…)+$ wrapper to whatever you want (e.g., {1}, {5}, {2,17}, or {1,}). It cannot be done with the current (pre-PR) emoji regex, for reasons explained in my last comment.

But this seems like something for a follow-up issue. I'd prefer to land this PR as is and for new functionality to be added afterward.

PS: The labels for this PR should include bug.

fabian-hiller · 2024-07-07T13:29:37Z

But this seems like something for a follow-up issue. I'd prefer to land this PR as is and for new functionality to be added afterward.

I agree.

[...] be sensibly used to refer to the number of discrete emoji to match

Any ideas on how to implement this? Maybe we could add something like a minChar, maxChar and char action.

slevithan · 2024-07-07T16:05:33Z

Okay, I looked at valibot.dev/api/string/ and valibot.dev/api/emoji/ to better understand what you're referring to.

I agree that minLength, maxLength, and length must continue to refer to JavaScript's UTF-16 code units.

The term character is very overloaded so I'd advise against using char, etc. In Unicode terms, a user-perceived character (such as an individual, complete emoji) is a grapheme cluster.

A concrete example is the emoji '👩🏻‍🏫'.
Unicode Name: Woman Teacher: Light Skin Tone.

// Code unit length
'👩🏻‍🏫'.length;
// → 7
// Each astral code point (above U+FFFF) is divided into high and low surrogates

// Code point length
[...'👩🏻‍🏫'].length;
// → 4
// These are: U+1F469 U+1F3FB U+200D U+1F3EB
// (U+1F469 U+1F3FB) is '👩🏻', U+200D is a Zero-Width Joiner, and U+1F3EB is '🏫'

// Grapheme cluster length
[...new Intl.Segmenter().segment('👩🏻‍🏫')].length;
// → 1

Intl.Segmenter is in Node.js 16+ (browser support table). The JS library orling/grapheme-splitter extends support backward.

Since Intl.Segmenter's documentation uses "grapheme" to refer to grapheme clusters, I think it's okay to use grapheme the same way (even though it actually segments grapheme clusters).

I think minGraphemeLength, maxGraphemeLength, and graphemeLength would be preferable to controlling this via the arguments of emoji, since counting grapheme clusters is broadly useful, beyond emoji. E.g. the Spanish letter ñ can be represented as either '\u00F1' ("ñ") or '\u006E\u0303' ("n" following by the nonspacing mark "Combining Tilde"). So either one or two code points, depending on representation, but always one grapheme cluster.

slevithan · 2024-07-10T17:23:49Z

Edited my last comment to use '👩🏻‍🏫' (a grapheme cluster with two graphemes) instead of '👩🏻‍👩🏻‍👦🏻‍👦🏻' (a grapheme cluster with four graphemes), since the latter currently only renders as a single user-perceived character on Microsoft and Facebook platforms. The emoji sets from Apple and Google don't include a unique design for it, so at least on iOS it renders as four discrete emoji characters, while the cluster is still selected as a single character.

fabian-hiller · 2024-07-14T15:53:48Z

Sorry for the late reply. I took a week off. Thanks for your research. I like the idea of adding a minGraphemes and maxGraphemes and graphemes action using Intl.Segmenter to count emojis. We can work on it after merging this PR.

Would you rename emoji to emojis to reflect that multiple emoji are allowed? Should we update the unit tests of emoji to reflect the changes in this PR?

slevithan · 2024-07-14T17:12:25Z

Would you rename emoji to emojis to reflect that multiple emoji are allowed?

The word emoji is already both singular and plural, so the current name is good.

Should we update the unit tests of emoji to reflect the changes in this PR?

Done.

fabian-hiller · 2024-07-14T18:08:54Z

This PR looks good to me now. Can I merge it?

slevithan · 2024-07-16T14:14:47Z

LGTM!

fabian-hiller · 2024-07-30T14:40:12Z

v0.37.0 is available

Movsar-Khalakhoev · 2024-08-17T20:46:01Z

Due to changes in EMOJI_REGEX, version 0.37.0 does not work on React Native.
Hermes had problems working with regular expressions. Now the problem is fixed, but the update has not yet reached the RN

fabian-hiller · 2024-08-18T11:29:08Z

What do you recommend? Wait for the fix to be released or make changes to Valibot?

slevithan · 2024-08-18T14:18:40Z

@Movsar-Khalakhoev following a chain of issues from the one you linked to, I see that Hermes' previous lack of support for Unicode properties was leading to Zod not working with it due to the same regex that Valibot was using prior to 0.37.0. See: StefanTerdell/zod-to-json-schema#129

So I'm not sure why the previous version of Valibot would have worked with Hermes if this is the issue. It might be helpful to provide more details about the issue you're seeing.

slevithan · 2024-08-19T14:37:52Z

Also, although I haven't verified that the Hermes/RN fix is working, it looks like the Hermes fix (facebook/hermes#1295) was included in Hermes 0.13.0 a few days ago, which was included in the React Native 0.75.1 release.

Improve regexes

66dadad

fabian-hiller self-assigned this Jun 20, 2024

fabian-hiller added the enhancement New feature or request label Jun 20, 2024

slevithan mentioned this pull request Jun 21, 2024

Add short regex pattern compatible with ES2018 that matches whatever emoji are supported natively mathiasbynens/rgi-emoji-regex-pattern#3

Closed

fabian-hiller force-pushed the main branch from 3975044 to 1b3c4d9 Compare July 2, 2024 13:06

Update emoji regex and revert IP regex changes

026d86f

Update comment about RGI_Emoji regex

126b100

fabian-hiller added the fix A smaller enhancement or bug fix label Jul 7, 2024

slevithan mentioned this pull request Jul 7, 2024

Improve DX with babel-plugin-transform-regex #704

Open

Update emoji tests

93954de

fabian-hiller and others added 3 commits July 14, 2024 19:59

Update JSDoc of EMOJI_REGEX and remove unused code

5b5fa0c

Merge branch 'main' into regex

84a61a1

Update changelog of library

efee092

Merge branch 'main' into regex

d6c8700

fabian-hiller merged commit 77df686 into fabian-hiller:main Jul 16, 2024
6 checks passed

fabian-hiller mentioned this pull request Aug 15, 2024

Add length validators based on the Intl.Segmenter API #784

Closed

This was referenced Sep 19, 2024

[Snyk] Upgrade: , mime, valibot DevangPatelUK/BotFramework-WebChat#343

Open

[Snyk] Upgrade: , mime, valibot DevangPatelUK/BotFramework-WebChat#364

Open

[Snyk] Upgrade valibot from 0.30.0 to 0.41.0 DevangPatelUK/BotFramework-WebChat#374

Open

sdcs26 mentioned this pull request Oct 3, 2024

[Snyk] Upgrade valibot from 0.30.0 to 0.41.0 sdcs26/Sigim-React-master-front-#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve regexes #666

Improve regexes #666

slevithan commented Jun 20, 2024

fabian-hiller commented Jun 20, 2024

slevithan commented Jun 20, 2024 •

edited

Loading

fabian-hiller commented Jun 20, 2024

slevithan commented Jun 28, 2024

fabian-hiller commented Jun 28, 2024

slevithan commented Jul 2, 2024

slevithan commented Jul 3, 2024

fabian-hiller commented Jul 5, 2024

slevithan commented Jul 6, 2024 •

edited

Loading

fabian-hiller commented Jul 6, 2024

slevithan commented Jul 6, 2024 •

edited

Loading

fabian-hiller commented Jul 7, 2024

slevithan commented Jul 7, 2024 •

edited

Loading

slevithan commented Jul 10, 2024

fabian-hiller commented Jul 14, 2024

slevithan commented Jul 14, 2024

fabian-hiller commented Jul 14, 2024

slevithan commented Jul 16, 2024

fabian-hiller commented Jul 30, 2024

Movsar-Khalakhoev commented Aug 17, 2024

fabian-hiller commented Aug 18, 2024

slevithan commented Aug 18, 2024

slevithan commented Aug 19, 2024 •

edited

Loading

Improve regexes #666

Improve regexes #666

Conversation

slevithan commented Jun 20, 2024

fabian-hiller commented Jun 20, 2024

slevithan commented Jun 20, 2024 • edited Loading

fabian-hiller commented Jun 20, 2024

slevithan commented Jun 28, 2024

fabian-hiller commented Jun 28, 2024

slevithan commented Jul 2, 2024

slevithan commented Jul 3, 2024

fabian-hiller commented Jul 5, 2024

slevithan commented Jul 6, 2024 • edited Loading

On emoji-regex-xs

Problems with Valibot's current emoji regex

fabian-hiller commented Jul 6, 2024

slevithan commented Jul 6, 2024 • edited Loading

fabian-hiller commented Jul 7, 2024

slevithan commented Jul 7, 2024 • edited Loading

slevithan commented Jul 10, 2024

fabian-hiller commented Jul 14, 2024

slevithan commented Jul 14, 2024

fabian-hiller commented Jul 14, 2024

slevithan commented Jul 16, 2024

fabian-hiller commented Jul 30, 2024

Movsar-Khalakhoev commented Aug 17, 2024

fabian-hiller commented Aug 18, 2024

slevithan commented Aug 18, 2024

slevithan commented Aug 19, 2024 • edited Loading

slevithan commented Jun 20, 2024 •

edited

Loading

slevithan commented Jul 6, 2024 •

edited

Loading

slevithan commented Jul 6, 2024 •

edited

Loading

slevithan commented Jul 7, 2024 •

edited

Loading

slevithan commented Aug 19, 2024 •

edited

Loading