Emoji: Skin tone modifiers used in isolation #133

janlelis · 2024-11-13T11:22:27Z

Hi Jeff,

in the wcwidth specification you list skin tone modifiers as being zero-width. However, this is not always true, since they should be displayed as Emoji (2 columns) when used in isolation / not part of a (known) Emoji sequence:

https://www.unicode.org/reports/tr51/#def_basic_emoji_set

Related: #134

jquast · 2024-11-13T15:24:58Z

Yes. There are combining characters that also have this exact same behavior.

This is difficult because it would mean having a definition of, "0 width when in sequence with [... large list ...], width 2 otherwise", and tracking every likely combination.

I think downstream developers would all wish to have a faster wcwidth measurement at the cost of unable to measure these edge cases. Only with emoji is this list somewhat possible, (emoji-variation-seuqneces.txt), I trialed a small implementation of this, but it seemed like a very high cost of resources/slower measurement for the ~0.1% use case!

I have decided for emoji and combining characters that have 0 or 2 widths, depending on their cojoined character in sequence, that wcwidth expects only that they are used in their most common and combined form, and their uncommon standalone form is not supported.

This means that if a developer wishes to display a single fitzpatrick skin tone emoji or a hangul jungseong (a korean combining character, see example test below), that they should not rely on python wcwidth library to measure the width.

An Example test of jungseong, where wcwidth knowingly gets it wrong,

wcwidth/tests/test_core.py

Lines 225 to 252 in 57cfbda

    
           def test_kr_jamo(): 
        
               """ 
        
               Test basic combining of HANGUL CHOSEONG and JUNGSEONG 
        
               Example and from Raymond Chen's blog post, 
        
               https://devblogs.microsoft.com/oldnewthing/20201009-00/?p=104351 
        
               """ 
        
               # This is an example where both characters are "wide" when displayed alone. 
        
               # 
        
               # But JUNGSEONG (vowel) is designed for combination with a CHOSEONG (consonant). 
        
               # 
        
               # This wcwidth library understands their width only when combination, 
        
               # and not by independent display, like other zero-width characters that may 
        
               # only combine with an appropriate preceding character. 
        
               phrase = ( 
        
                   u"\u1100"  # ᄀ HANGUL CHOSEONG KIYEOK (consonant) 
        
                   u"\u1161"  # ᅡ HANGUL JUNGSEONG A (vowel) 
        
               ) 
        
               expect_length_each = (2, 0) 
        
               expect_length_phrase = 2 
        
               # exercise, 
        
               length_each = tuple(map(wcwidth.wcwidth, phrase)) 
        
               length_phrase = wcwidth.wcswidth(phrase) 
        
               # verify. 
        
               assert length_each == expect_length_each 
        
               assert length_phrase == expect_length_phrase

jquast added bug question needs-research labels Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emoji: Skin tone modifiers used in isolation #133

Emoji: Skin tone modifiers used in isolation #133

janlelis commented Nov 13, 2024 •

edited

Loading

jquast commented Nov 13, 2024 •

edited

Loading

Emoji: Skin tone modifiers used in isolation #133

Emoji: Skin tone modifiers used in isolation #133

Comments

janlelis commented Nov 13, 2024 • edited Loading

jquast commented Nov 13, 2024 • edited Loading

janlelis commented Nov 13, 2024 •

edited

Loading

jquast commented Nov 13, 2024 •

edited

Loading