-
-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve regexes #666
Improve regexes #666
Conversation
Thank you!
What do you recommend?
I had implemented this the same way before, but some bundlers had trouble tree-shaking it and including the regex in every bundle, whether used or not. So I repeated it. Due to compression it is probably also better in terms of bundle size. |
I recommend using it as written, with
Fair enough, if bundle size trumps readability. Another solution would be to use the |
Thanks you for your feedback! It may take some time to finish this PR as other things have a higher priority at the moment. |
I've done a lot more research on the emoji regex and published an improved/fixed version as |
Thank you very much! I will probably review and merge this PR next week. |
OK, updated. Passes tests and lint. |
In the last-added commit, I updated the comment about To clarify what the emoji regex in this PR matches:
On that last point, unfortunately, some common-sense and broadly-supported emoji are not officially in the "RGI" list. Even the Unicode org provides emoji-test.txt that mixes in non-RGI emoji strings to help identify real-world emoji. And some emoji are commonly used in an underqualified or overqualified way (by including or excluding certain invisible Unicode markers) that prevents them from being matched by The regex here allows overqualified and underqualified emoji using a general pattern that matches all Unicode sequences that follow the structure of valid emoji. |
Thank you for your research! Is the new emoji regex more strict or accurate than the old one? |
Yes. If by "the old one" you meant my own iterations, this is the final version based on research in depth, and I've now therefore also published it as its own library (emoji-regex-xs). If there are changes to emoji-regex-xs in the future (e.g., if new versions of Unicode modify the general patterns for emoji), it can easily be updated by anyone, by simply copying the pattern from future versions of emoji-regex-xs and wrapping it in On emoji-regex-xsThis library shares the API and 3,000+ tests with emoji-regex. emoji-regex is very large (13 KB uncompressed), but it is authoritative (its author helped add things like Problems with Valibot's current emoji regexIf by "the old one" you meant the regex currently used in the code that this PR replaces, then yes, this PR is both more strict and more accurate. Here is the current regex that is being replaced: /^[\p{Extended_Pictographic}\p{Emoji_Component}]+$/u This is extremely wrong. I'm assuming you picked it up from Zod, which uses the same thing, and I can find other people posting it online, which is where Zod probably got it from. It presumably has spread virally because:
There are two big problems with the regex that this PR replaces:
Regarding the first problem, I already mentioned some of its false positives in earlier comments, but here are some additional details (not comprehensive):
The emoji regex in this PR fixes all of these issues. |
Thanks again for your research and detailed answer! I thought it would be the best DX if |
I'm not familiar with Valibot's APIs. Could Certainly, with the new emoji regex, something like this could easily be done. You'd just need to change the But this seems like something for a follow-up issue. I'd prefer to land this PR as is and for new functionality to be added afterward. PS: The labels for this PR should include |
I agree.
Any ideas on how to implement this? Maybe we could add something like a |
Okay, I looked at valibot.dev/api/string/ and valibot.dev/api/emoji/ to better understand what you're referring to. I agree that The term character is very overloaded so I'd advise against using A concrete example is the emoji '👩🏻🏫'. // Code unit length
'👩🏻🏫'.length;
// → 7
// Each astral code point (above U+FFFF) is divided into high and low surrogates
// Code point length
[...'👩🏻🏫'].length;
// → 4
// These are: U+1F469 U+1F3FB U+200D U+1F3EB
// (U+1F469 U+1F3FB) is '👩🏻', U+200D is a Zero-Width Joiner, and U+1F3EB is '🏫'
// Grapheme cluster length
[...new Intl.Segmenter().segment('👩🏻🏫')].length;
// → 1
Since I think |
Edited my last comment to use '👩🏻🏫' (a grapheme cluster with two graphemes) instead of '👩🏻👩🏻👦🏻👦🏻' (a grapheme cluster with four graphemes), since the latter currently only renders as a single user-perceived character on Microsoft and Facebook platforms. The emoji sets from Apple and Google don't include a unique design for it, so at least on iOS it renders as four discrete emoji characters, while the cluster is still selected as a single character. |
Sorry for the late reply. I took a week off. Thanks for your research. I like the idea of adding a Would you rename |
The word emoji is already both singular and plural, so the current name is good.
Done. |
This PR looks good to me now. Can I merge it? |
LGTM! |
v0.37.0 is available |
Due to changes in |
What do you recommend? Wait for the fix to be released or make changes to Valibot? |
@Movsar-Khalakhoev following a chain of issues from the one you linked to, I see that Hermes' previous lack of support for Unicode properties was leading to Zod not working with it due to the same regex that Valibot was using prior to 0.37.0. See: StefanTerdell/zod-to-json-schema#129 So I'm not sure why the previous version of Valibot would have worked with Hermes if this is the issue. It might be helpful to provide more details about the issue you're seeing. |
Also, although I haven't verified that the Hermes/RN fix is working, it looks like the Hermes fix (facebook/hermes#1295) was included in Hermes 0.13.0 a few days ago, which was included in the React Native 0.75.1 release. |
I didn't audit all the regexes. Just the specific ones described below.
EMOJI_REGEX
0
, etc.),*
, bare U+200D (ZWJ), and some symbols like👁
,✈
,🏳
, and♂
even when they're not followed by U+FE0F (none of which should match). So I fixed it.\p{Me}
, or if it's a more limited set like just U+20E3. The segment in question (\p{Emoji}\uFE0F\p{Me}?
) is used to match emoji like2️⃣
which is made up of 3 code points: U+32 U+FE0F U+20E3.HEXADECIMAL_REGEX
0h|0x
with0[hx]
.IPV4_REGEX
(?:(?:[1-9]|1\d|2[0-4])?\d|25[0-5])
with(?:2[0-4]\d|25[0-5]|1\d\d|[1-9]?\d)
. IMO it's easier to read without the nested grouping, and it's the same length. Then replaced\d\d
with\d{2}
to work around an eslint error that I don't agree with.IPV6_REGEX
IP_REGEX