Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emoji: Skin tone modifiers used in isolation #133

Open
janlelis opened this issue Nov 13, 2024 · 1 comment
Open

Emoji: Skin tone modifiers used in isolation #133

janlelis opened this issue Nov 13, 2024 · 1 comment

Comments

@janlelis
Copy link

janlelis commented Nov 13, 2024

Hi Jeff,

in the wcwidth specification you list skin tone modifiers as being zero-width. However, this is not always true, since they should be displayed as Emoji (2 columns) when used in isolation / not part of a (known) Emoji sequence:

https://www.unicode.org/reports/tr51/#def_basic_emoji_set

Related: #134

@jquast
Copy link
Owner

jquast commented Nov 13, 2024

Yes. There are combining characters that also have this exact same behavior.

This is difficult because it would mean having a definition of, "0 width when in sequence with [... large list ...], width 2 otherwise", and tracking every likely combination.

I think downstream developers would all wish to have a faster wcwidth measurement at the cost of unable to measure these edge cases. Only with emoji is this list somewhat possible, (emoji-variation-seuqneces.txt), I trialed a small implementation of this, but it seemed like a very high cost of resources/slower measurement for the ~0.1% use case!

I have decided for emoji and combining characters that have 0 or 2 widths, depending on their cojoined character in sequence, that wcwidth expects only that they are used in their most common and combined form, and their uncommon standalone form is not supported.

This means that if a developer wishes to display a single fitzpatrick skin tone emoji or a hangul jungseong (a korean combining character, see example test below), that they should not rely on python wcwidth library to measure the width.

An Example test of jungseong, where wcwidth knowingly gets it wrong,

wcwidth/tests/test_core.py

Lines 225 to 252 in 57cfbda

def test_kr_jamo():
"""
Test basic combining of HANGUL CHOSEONG and JUNGSEONG
Example and from Raymond Chen's blog post,
https://devblogs.microsoft.com/oldnewthing/20201009-00/?p=104351
"""
# This is an example where both characters are "wide" when displayed alone.
#
# But JUNGSEONG (vowel) is designed for combination with a CHOSEONG (consonant).
#
# This wcwidth library understands their width only when combination,
# and not by independent display, like other zero-width characters that may
# only combine with an appropriate preceding character.
phrase = (
u"\u1100" # ᄀ HANGUL CHOSEONG KIYEOK (consonant)
u"\u1161" # ᅡ HANGUL JUNGSEONG A (vowel)
)
expect_length_each = (2, 0)
expect_length_phrase = 2
# exercise,
length_each = tuple(map(wcwidth.wcwidth, phrase))
length_phrase = wcwidth.wcswidth(phrase)
# verify.
assert length_each == expect_length_each
assert length_phrase == expect_length_phrase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants