Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info regarding future and correctness of the project #13

Closed
rsalmei opened this issue Sep 29, 2020 · 8 comments
Closed

Info regarding future and correctness of the project #13

rsalmei opened this issue Sep 29, 2020 · 8 comments

Comments

@rsalmei
Copy link

rsalmei commented Sep 29, 2020

Hey man, I'm the author of alive-progress. I'm struggling to correctly support emojis in rsalmei/alive-progress#19, and I think this project could help me.

  • Please, do you really intend to keep updating this project? For every new Unicode version?
  • Performance doesn't really matter to me, since I've implemented a spinner compiler just for this, but yours seems to be fast anyway. It does not use any binary extension, do it? I'm asking because there's a cython folder with a few .c files in my site-packages...
  • I'm only interested in correctness, and this one seems very nice. I've created a brute force test using the emoji-test.txt from unicode.org, and while testing several combinations of emojis, yours has only failed on the Fitz Patrick skin tone modifiers when used alone (but the unicode spec states that they should be used as a normal emoji when used alone):

image

My brute force validation ensures all chars described on that file are detected, even when concatenated with other chars. You can see in the image that it fails where: 1. two skin tones are used one after the other (I expected two graphemes, not one); 2. an ascii char followed by a skin tone and another ascii (expected three graphemes, not the skin tone of the ascii char); and 3. two ascii followed by a skin tone (same as 2. before).
But it is ok, it works in the vast majority (and the regex dependency demonstrated the same results).


So, I'm thinking now about how to continue my wide chars/emoji support:

  • include your project as a dependency;
  • include regex as a dependency (but it does have a binary extension, so I'm not willing to)
  • implement my own regexp to detect graphemes (here I would not need actual sequences validation, just the few formats, but it's not that easy anyway)

Thank you man!

@rsalmei rsalmei changed the title Info regarding project Info regarding future and correctness of the project Sep 30, 2020
@alvinlindstam
Copy link
Owner

Hi @rsalmei

Sorry about the delay, thanks for the questions.

Please, do you really intend to keep updating this project? For every new Unicode version?

Yes, I plan to. Upgrading unicode versions is a 15 minute job for me, assuming they don't change anything significant in the grapheme/boundary annex. I've done it for a few years and enjoyed it so far. This is a small open source project, so anything can happen with it, of course. If I disappear, the repo describes how to upgrade the unicode version if one wants to fork it or vendor it.

@alvinlindstam
Copy link
Owner

Performance doesn't really matter to me, since I've implemented a spinner compiler just for this, but yours seems to be fast anyway. It does not use any binary extension, do it? I'm asking because there's a cython folder with a few .c files in my site-packages...

No C extensions in this package, and currently no dependencies. I was exploring some cython extension for this to speed it up, but I wouldn't want to make it required.

There is https://pypi.org/project/PyICU/ for those prefering C-performance. There might be space for an intermediate solution which uses C extensions but doesn't depend on a non-pip dependency (ICU) being installed, not sure if it is this project though. It probably won't happen.

@alvinlindstam
Copy link
Owner

I'm only interested in correctness, and this one seems very nice. I've created a brute force test using the emoji-test.txt from unicode.org, and while testing several combinations of emojis, yours has only failed on the Fitz Patrick skin tone modifiers when used alone (but the unicode spec states that they should be used as a normal emoji when used alone)
..
My brute force validation ensures all chars described on that file are detected, even when concatenated with other chars. You can see in the image that it fails where: 1. two skin tones are used one after the other (I expected two graphemes, not one); 2. an ascii char followed by a skin tone and another ascii (expected three graphemes, not the skin tone of the ascii char); and 3. two ascii followed by a skin tone (same as 2. before).
But it is ok, it works in the vast majority (and the regex dependency demonstrated the same results).

Interesting. This project performs and passes all tests in https://www.unicode.org/Public/13.0.0/ucd/auxiliary/GraphemeBreakTest.txt, which is Unicode's example test for grapheme boundaries. As per the unicode 29 annex, the breaking here is "correct".

This library implements the breaking rules for Extended Grapheme Clusters (http://unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table). The skin tone modifiers are classified as "Extend" code points, while the ascii chars are "other". According to the "3.1.1 Grapheme Cluster Boundary Rules" section, we should never break before "Extend" codepoints, they always extend their previous code point.

Generally, broken sequences where codepoints are not combined in expected ways may cause all sorts of weirdness in any text.

I'm not sure how you get to the expectation that they should be separate graphemes based on the emoji-test data, could you elaborate?

@rsalmei
Copy link
Author

rsalmei commented Jan 5, 2021

Yes, I plan to. Upgrading unicode versions is a 15 minute job for me, assuming they don't change anything significant in the grapheme/boundary annex. I've done it for a few years and enjoyed it so far. This is a small open source project, so anything can happen with it, of course. If I disappear, the repo describes how to upgrade the unicode version if one wants to fork it or vendor it.

That's very nice, I'd already committed to your lib!
My next major version of alive-progress will include it 👍

@rsalmei
Copy link
Author

rsalmei commented Jan 5, 2021

I'm not sure how you get to the expectation that they should be separate graphemes based on the emoji-test data, could you elaborate?

Yeah, of course!
I've found this info on http://unicode.org/reports/tr51/, item 2.4 Diversity:

When used alone, the default representation of these modifier characters is a color swatch.

But what exactly is "using a skin modifier alone"? I've assumed it is when not being preceded by a human emoji.
Then the text reasons about this, explain it better. And ends that section with:

Any other intervening character causes the emoji modifier to appear as a free-standing character. Thus
image

So it does seem that "any other intervening character" should make the skin modifier appear as a free-standing character.
What do you think? Am I right to infer that two skin tones alongside and an ascii char + skin tone should both be rendered split?

@alvinlindstam
Copy link
Owner

Unicode and text representation will probably always be a mystery :)

I'm not sure how one is supposed to represent a standalone skin tone modifier, especially if you first include an emoji and a following modifier and then a sequence of standalone modifiers. In the example of intervening character you mentioned, they seem to be using U+200B (zero width space) to separate an emoji from the skin tone.

This does boil down to a limitation of this library and of of grapheme clusters in general though; how a text is rendered is eventually up to the text rendering entity to decide and implement, including the font designer. Grapheme clusters may be a good approximation of what a human would perceive as a textual entity given a well implemented and up to date font and renderer, but they are not guaranteed to match. The specification in annex #29 is a simplification, and there will be cases where some combination of code points is rendered into different groups than one might expect, even with grapheme awareness.

Off the top of my head:

  1. Emoji sequences not implemented by the vendor.
  2. Non-standard emoji sequences (I think Microsoft for example allow different skin tone modifiers on individuals in a family emoji sequence) rendered on other non-supporting platforms
  3. Differences in how one display widths of zero-width-items (like U+200D) in monospaced contexts. Some display them with normal width, some not.
  4. Regional indication sequences (national flags) for country codes that don't exist. According to annex 29, one should consider any pair of consecutive regional indicators as a grapheme cluster but for renderers, one should only render it as an entity if there is a font implementation of that flag.

@alvinlindstam
Copy link
Owner

Closing this, feels resolved

@rsalmei
Copy link
Author

rsalmei commented Jan 8, 2021

Yeah, thank you @alvinlindstam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants