Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grapheme cluser (\X) selector capturing multiple character #410

Closed
Ayesh opened this issue Jun 2, 2024 · 2 comments
Closed

Grapheme cluser (\X) selector capturing multiple character #410

Ayesh opened this issue Jun 2, 2024 · 2 comments

Comments

@Ayesh
Copy link
Contributor

Ayesh commented Jun 2, 2024

Using PCRE2 10.43, the \X selector seems to capture more than one graphemes, as if does not break before the start of a new grapheme cluster.

Regex: \X
Input: πŸ³οΈβ€πŸŒˆπŸ΄β€β˜ οΈ (U+1F3F3 U+FE0F U+200D U+1F308 + U+1F3F4 U+200D U+2620 U+FE0F)

When run, \X matches both flag graphemes: Regex101 preview.

Could you kindly shed me a light if I'm missing something?

Thank you.

@PhilipHazel
Copy link
Collaborator

This is a bug, caused by my misreading or misunderstanding one of the rules in Unicode Annex 29, way back when I implemented \X. I'm a bit surprised it's taken so long for it to hit anybody. Furthermore, the documentation correctly describes what the code does, but it's not what it's supposed to do! (Somewhere I even noted a difference from Perl, but never investigated.) I hope to have this fixed in HEAD in the next day or two. This is a very timely issue because the 10.44 release will be forthcoming once this fix is done. Thanks for the report.

@Ayesh
Copy link
Contributor Author

Ayesh commented Jun 5, 2024

Thank you. I tested after commit 067c2f1, it worked correctly!

@Ayesh Ayesh closed this as completed Jun 5, 2024
Ayesh added a commit to Ayesh/polyfill that referenced this issue Jun 5, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/php-src that referenced this issue Jun 7, 2024
Previously: phpGH-13413.

This version also contains a fix with `preg_match('\X')`, so that it
can correctly detect grapheme clusters (PCRE2Project/pcre2#410).
This is useful to correctly [polyfill the new `grapheme_str_split`
function](https://php.watch/versions/8.4/grapheme_str_split#polyfill).
Ayesh added a commit to Ayesh/php-src that referenced this issue Jun 7, 2024
Previously: phpGH-13413.

This version also contains a fix with `preg_match('\X')`, so that it
can correctly detect grapheme clusters (PCRE2Project/pcre2#410).
This is useful to correctly [polyfill the new `grapheme_str_split`
function](https://php.watch/versions/8.4/grapheme_str_split#polyfill).

Diff: pcre2lib [v10.43..v10.44](PCRE2Project/pcre2@pcre2-10.43...pcre2-10.44)
Ayesh added a commit to Ayesh/php-src that referenced this issue Jun 8, 2024
Previously: phpGH-13413.

This version also contains a fix with `preg_match('\X')`, so that it
can correctly detect grapheme clusters (PCRE2Project/pcre2#410).
This is useful to correctly [polyfill the new `grapheme_str_split`
function](https://php.watch/versions/8.4/grapheme_str_split#polyfill).

Diff: pcre2lib [v10.43..v10.44](PCRE2Project/pcre2@pcre2-10.43...pcre2-10.44)
Ayesh added a commit to Ayesh/php-src that referenced this issue Jun 8, 2024
Previously: phpGH-13413.

This version also contains a fix with `preg_match('\X')`, so that it
can correctly detect grapheme clusters (PCRE2Project/pcre2#410).
This is useful to correctly [polyfill the new `grapheme_str_split`
function](https://php.watch/versions/8.4/grapheme_str_split#polyfill).

Diff: pcre2lib [v10.43..v10.44](PCRE2Project/pcre2@pcre2-10.43...pcre2-10.44)
Ayesh added a commit to Ayesh/php-src that referenced this issue Jun 8, 2024
Previously: phpGH-13413.

This version also contains a fix with `preg_match('\X')`, so that it
can correctly detect grapheme clusters (PCRE2Project/pcre2#410).
This is useful to correctly [polyfill the new `grapheme_str_split`
function](https://php.watch/versions/8.4/grapheme_str_split#polyfill).

Diff: pcre2lib [v10.43..v10.44](PCRE2Project/pcre2@pcre2-10.43...pcre2-10.44)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Jun 8, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
nielsdos pushed a commit to php/php-src that referenced this issue Jun 8, 2024
Previously: GH-13413.

This version also contains a fix with `preg_match('\X')`, so that it
can correctly detect grapheme clusters (PCRE2Project/pcre2#410).
This is useful to correctly [polyfill the new `grapheme_str_split`
function](https://php.watch/versions/8.4/grapheme_str_split#polyfill).

Diff: pcre2lib [v10.43..v10.44](PCRE2Project/pcre2@pcre2-10.43...pcre2-10.44)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Ayesh added a commit to Ayesh/polyfill that referenced this issue Sep 9, 2024
Add a polyfill for the `grapheme_str_split` function added in PHP 8.4.

Requires PHP 7.3, because the polyfill is based on `\X` Regex, and it
only works properly on PCRE2, which
[only comes with PHP 7.3+](https://php.watch/versions/7.3/pcre2).

Further, there are some cases that the polyfill cannot split complex
characters (such as two consecutive country flag Emojis). This is now
fixed in [PCRE2Project/pcre2#410](PCRE2Project/pcre2#410).
However, this change will likely only make it to PHP 8.4.

References:
 - [RFC: Grapheme cluster for `str_split` function: `grapheme_str_split`](https://wiki.php.net/rfc/grapheme_str_split)
 - [PHP.Watch: PHP 8.4: New `grapheme_str_split` function](https://php.watch/versions/8.4/grapheme_str_split)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants