Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should wcwidth have "Treat ambiguos-width as wide" option? #123

Open
keatonLiu opened this issue Mar 30, 2024 · 10 comments
Open

Should wcwidth have "Treat ambiguos-width as wide" option? #123

keatonLiu opened this issue Mar 30, 2024 · 10 comments

Comments

@keatonLiu
Copy link

keatonLiu commented Mar 30, 2024

import wcwidth

if __name__ == '__main__':
    print(wcwidth.wcswidth("①你好"))
    print(wcwidth.wcswidth("你好啊"))

results in:
image
But it displays 2 character width in monospace font:
image
image

@jquast
Copy link
Owner

jquast commented Mar 30, 2024

Which terminal emulator are you using in this example?

For iTerm2, this is correct,

image

As well as WezTerm,

image

And also Kitty,

image

@keatonLiu
Copy link
Author

Maybe wcwidth only focus on terminal font? I'm using PrettyTable to generate table, which depends on wcwidth, and I want to display the table text on browser. For example, I'm using chrome, and I found monospace fonts works fine most of the time. But for some unicode words, it displays with a different length.

@keatonLiu
Copy link
Author

It will be helpful if I can provide the font family and get a more general result. Is it possible?

@jquast
Copy link
Owner

jquast commented Mar 30, 2024

wcwidth is primarily focused for terminals, that is if browsers and terminals disagree we would rather match with terminals. Although I expect a javascript or browser-based library that is more focused on browser width, I cannot find one at this moment, please suggest if you do.

Browsers are able to communicate directly with the font engine of the operating system, while wcwidth in python and other languages are not, so we generally take a more naive approach. And this is probably why most terminals are also wrong in this case while browsers are not.

In this case, the problem with ① (https://codepoints.net/U+2460) is that it is Ambiguous width (https://unicode.org/reports/tr11/#Ambiguous) and,

They have a “resolved” width of either narrow or wide depending on the context of their use.

In the following code blocks I use the same character, one with english letters on the same line,

①2345
12345

and another of your example with your Mandarin Chinese "hello",

①你好
12345

Although they render differently sized, at least on my browser (Firefox 120.0.1), they have approximately the same width. I will say that monospace fonts do not always align vertically in browsers (note how the number '5' does not align in the first example), while they always do in terminals.

Screenshot of the above,
image
(End screenshot)

It would require more experimentation, but maybe for a page of Chinese locale it would render differently, such as in your original screenshot, I'm not really sure.

In any case, there are options on many terminals, to cause ambiguous width characters to display as 2 cells,

I'm not certain, but maybe this option is more frequently used for east-asian language users in terminals?

But it is very problematic -- the entire software stack needs to agree to "treat ambiguous width as wide", for example, here is an "$LD_PRELOAD-able library and a wrapper script" that patches posix wcwidth for this option, and references many issues and bugs about this option. https://github.com/fumiyas/wcwidth-cjk

The "Terminal Working Group" tried to come to a consensus about this and other issues, https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9#note_406682 -- there was a great deal of discussion but this "Working Group" specifications project has failed to come to any consensus at all on any single issue (the "accepted" folder is empty, 31 open issues)

And, maybe this library could also provide such an option, to "treat ambiguous width as wide". And, I will rewrite this github issue to match that request.

@jquast jquast changed the title Some unicode width not correct Should wcwidth have "Treat ambiguos-width as wide" option? Mar 30, 2024
@GalaxySnail
Copy link
Collaborator

It's also rendered with a width of 1 in Windows Terminal.

Even more, it's rendered with a width of 1 in my webbrowser (chromium).

I personally agree that "①" should be East Asian Wide, but unfortunately it is East Asian Ambiguous (and a similar character U+2780 is East Asian Neutral). In my opinion, it may need to be addressed in Unicode, but I'm not sure. Unicode is a bit chaotic. ¯\_(ツ)_/¯

@keatonLiu
Copy link
Author

keatonLiu commented Mar 30, 2024

Thank you for so much work! You are very helpful.
I have tested that in my windows terminal and gets the same result.
image

I understand it is because the ① character is an East Asian Ambiguous character, which is treated as different size in different context. I agree that it can have a "treat ambiguous width as wide" option because in most cases it displays the same size as a east asian character in my locale.
You can visit this website and get an intuitive demo: https://www.zhonghuazidian.com/zi/%E2%91%A0
On my browser, chrome:
image
Even in Word:
image
I think it will be a wide width character if you use a monospace font-family in browser.

@keatonLiu
Copy link
Author

wcwidth is primarily focused for terminals, that is if browsers and terminals disagree we would rather match with terminals. Although I expect a javascript or browser-based library that is more focused on browser width, I cannot find one at this moment, please suggest if you do.

Browsers are able to communicate directly with the font engine of the operating system, while wcwidth in python and other languages are not, so we generally take a more naive approach. And this is probably why most terminals are also wrong in this case while browsers are not.

In this case, the problem with ① (https://codepoints.net/U+2460) is that it is Ambiguous width (https://unicode.org/reports/tr11/#Ambiguous) and,

They have a “resolved” width of either narrow or wide depending on the context of their use.

In the following code blocks I use the same character, one with english letters on the same line,

①2345
12345

and another of your example with your Mandarin Chinese "hello",

①你好
12345

Although they render differently sized, at least on my browser (Firefox 120.0.1), they have approximately the same width. I will say that monospace fonts do not always align vertically in browsers (note how the number '5' does not align in the first example), while they always do in terminals.

Screenshot of the above, image (End screenshot)

It would require more experimentation, but maybe for a page of Chinese locale it would render differently, such as in your original screenshot, I'm not really sure.

In any case, there are options on many terminals, to cause ambiguous width characters to display as 2 cells,

I'm not certain, but maybe this option is more frequently used for east-asian language users in terminals?

But it is very problematic -- the entire software stack needs to agree to "treat ambiguous width as wide", for example, here is an "$LD_PRELOAD-able library and a wrapper script" that patches posix wcwidth for this option, and references many issues and bugs about this option. https://github.com/fumiyas/wcwidth-cjk

The "Terminal Working Group" tried to come to a consensus about this and other issues, https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9#note_406682 -- there was a great deal of discussion but this "Working Group" specifications project has failed to come to any consensus at all on any single issue (the "accepted" folder is empty, 31 open issues)

And, maybe this library could also provide such an option, to "treat ambiguous width as wide". And, I will rewrite this github issue to match that request.

Interesting, I'm using chrome and displays in another way:
image

@fancidev
Copy link

fancidev commented Sep 20, 2024

I ran into a similar issue when displaying a Unicode filled square character U+25A0 in Windows console (command line prompt) using python_promptkit. Under one font it renders as single width; under another font it renders as double width.

Since Windows doesn’t provide a wcwidth function, it seems quite hard to tell the rendered width reliably. Maybe prompt_toolkit could actually render the text and compute the width from cursor advancement, but that seems a bit too much. So I just replace the Unicode character with an ASCII character that is definitely single width…

@jquast
Copy link
Owner

jquast commented Sep 20, 2024

Thanks for chiming in @fancidev, I did actually write a tool that does just as you describe, "could render the text and compute the width from cursor advancement", that is definitely possible! https://github.com/jquast/ucs-detect

Maybe there could be an environment variable that ucs-detect could export that wcwidth could make use of to more accurately determine any widths outside of specification of a given terminal

@fancidev
Copy link

fancidev commented Sep 24, 2024

Thanks for the info. Good to know there is already such a facility!

The demo on the homepage of ucs-detect runs through the characters on screen. I wonder if the width detection can be performed without echo?

If that’s possible, a possible solution could be to run through the ambiguous characters on application start-up (as well as upon special console events such as front change) and remember the results. (If the delay is small, the end-user would not feel it.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants