-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Info regarding future and correctness of the project #13
Comments
Hi @rsalmei Sorry about the delay, thanks for the questions.
Yes, I plan to. Upgrading unicode versions is a 15 minute job for me, assuming they don't change anything significant in the grapheme/boundary annex. I've done it for a few years and enjoyed it so far. This is a small open source project, so anything can happen with it, of course. If I disappear, the repo describes how to upgrade the unicode version if one wants to fork it or vendor it. |
No C extensions in this package, and currently no dependencies. I was exploring some cython extension for this to speed it up, but I wouldn't want to make it required. There is https://pypi.org/project/PyICU/ for those prefering C-performance. There might be space for an intermediate solution which uses C extensions but doesn't depend on a non-pip dependency (ICU) being installed, not sure if it is this project though. It probably won't happen. |
Interesting. This project performs and passes all tests in https://www.unicode.org/Public/13.0.0/ucd/auxiliary/GraphemeBreakTest.txt, which is Unicode's example test for grapheme boundaries. As per the unicode 29 annex, the breaking here is "correct". This library implements the breaking rules for Extended Grapheme Clusters (http://unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table). The skin tone modifiers are classified as "Extend" code points, while the ascii chars are "other". According to the "3.1.1 Grapheme Cluster Boundary Rules" section, we should never break before "Extend" codepoints, they always extend their previous code point. Generally, broken sequences where codepoints are not combined in expected ways may cause all sorts of weirdness in any text. I'm not sure how you get to the expectation that they should be separate graphemes based on the emoji-test data, could you elaborate? |
That's very nice, I'd already committed to your lib! |
Yeah, of course!
But what exactly is "using a skin modifier alone"? I've assumed it is when not being preceded by a human emoji.
So it does seem that "any other intervening character" should make the skin modifier appear as a free-standing character. |
Unicode and text representation will probably always be a mystery :) I'm not sure how one is supposed to represent a standalone skin tone modifier, especially if you first include an emoji and a following modifier and then a sequence of standalone modifiers. In the example of intervening character you mentioned, they seem to be using U+200B (zero width space) to separate an emoji from the skin tone. This does boil down to a limitation of this library and of of grapheme clusters in general though; how a text is rendered is eventually up to the text rendering entity to decide and implement, including the font designer. Grapheme clusters may be a good approximation of what a human would perceive as a textual entity given a well implemented and up to date font and renderer, but they are not guaranteed to match. The specification in annex #29 is a simplification, and there will be cases where some combination of code points is rendered into different groups than one might expect, even with grapheme awareness. Off the top of my head:
|
Closing this, feels resolved |
Yeah, thank you @alvinlindstam |
Hey man, I'm the author of alive-progress. I'm struggling to correctly support emojis in rsalmei/alive-progress#19, and I think this project could help me.
cython
folder with a few .c files in mysite-packages
...emoji-test.txt
from unicode.org, and while testing several combinations of emojis, yours has only failed on the Fitz Patrick skin tone modifiers when used alone (but the unicode spec states that they should be used as a normal emoji when used alone):My brute force validation ensures all chars described on that file are detected, even when concatenated with other chars. You can see in the image that it fails where: 1. two skin tones are used one after the other (I expected two graphemes, not one); 2. an ascii char followed by a skin tone and another ascii (expected three graphemes, not the skin tone of the ascii char); and 3. two ascii followed by a skin tone (same as 2. before).
But it is ok, it works in the vast majority (and the regex dependency demonstrated the same results).
So, I'm thinking now about how to continue my wide chars/emoji support:
Thank you man!
The text was updated successfully, but these errors were encountered: