wcwidth rules for unicode 11 #2568

jerch · 2019-11-15T00:21:38Z

A new attempt to solve the emoji issues. It can already be tested with this PR, but is still in a very early state.

Decided not to go the full unicode provider path, gonna start just with wcwidth table replacements. This way we can still deliver the old and the new versions in core, that can be selected by some option later on. Still the unicode provider approach of earlier attempts might get relevant again, if we go for other unicode features in the future (e.g. bidi, grapheme). Edit: UnicodeService added, V11 comes as addon implementation.

This PR gonna fix most emoji issues on newer systems. There are some newer unicode compose rules, that cannot be covered by simple wcwidth lookups. If we want those, we would have to implement true grapheme support. I already investigated in this direction some months ago with the result that this is not respected by any console program either. Thus not worth the impl headache atm.

TODO:

~~shrink data tables (create merge-diff to old table?)~~ not maintainable in the long run
cleanup

@Tyriar, @mofux Could you test under macOS and Windows if the emojis get the right width? (Still have no clue what Unicode versions these use for a certain OS version)

Closes #1709.

jerch · 2019-11-15T14:01:21Z

Differences to before:

+65KB lookup table at runtime
+10KB source code

Imho the +10KB in source code will hurt not much, since the V10 tables are somewhat repetitive to V6, I could prolly lower that to +5KB with a diff-merge. Not sure if its worth the trouble (much harder to maintain). Ideas?

Interface: CharWidth.ts now exports wcwidth and getStringCellWidth versioned as ...V6 or ...V10. Its the callers responsibility to choose the right version depending on options.unicodeVersion. This is already implemented in InputHandler.

Perfwise I see no difference, the 2nd lookup table creation adds <1 ms during loading time.

src/browser/Linkifier.ts

mofux · 2019-11-15T20:02:20Z

Tried on OSX and regular emojis won't break my prompt anymore 🎉 Combined emojis (👨‍👩‍👦‍👦) don't work, but they still look better than in Terminal.app and iTerm 👍

jerch · 2019-11-15T20:27:56Z

@mofux Thx for testing. Yeah those compound emojis will still behave differently, its not doable with wcwidth calculations alone. This would need grapheme parsing rules + output metrics returned by the font renderer. Both is hard to achieve and cannot be made failsafe across diff. systems (even fonts), well it should not be part of this PR. (or translate this into "dont use compound enoijs in terminal apps, they prolly never gonna work reliable")

Youre right with your concerns regarding addons, have not taken those into account. Guess I gonna revive the provider idea and put it into a service.

Tyriar · 2019-11-17T20:24:57Z

@jerch I forgot, did the other terminals have settings for the unicode version or just try to the best fit one based on popular OS's?

jerch · 2019-11-17T20:27:40Z

No most I looked into have one static wcwidth version, that evtl. gets updated with new releases and OS updates along the way. Only one I remember with configurable unicode settings is iTerm2.

Tyriar · 2019-11-17T20:28:38Z

Youre right with your concerns regarding addons, have not taken those into account. Guess I gonna revive the provider idea and put it into a service.

I think the model with the renderer works pretty well; shipping a (small) standard implementation (v6) and allowing to plug in a different one (v11). Using addons also enables the use of code splitting, while it might not make much of a difference by itself I fear death by 1000 papercuts if we keep including more and more into the core. Any opportunity for code splitting is great imo.

I remember tossing the idea around of having the unicode ones special so v6 isn't loaded at all if we're not using it, that might be over complicating things and we could defer improving upon that at a later date.

…ted_wcwidth

jerch · 2019-12-06T15:34:24Z

Refactored the PR to contain an UnicodeService, which maintains different Unicode versions. By default its linked with V6 (hardcoded), any other version can be added by an addon later on. Created an example addon for V11.

mofux · 2019-12-06T16:08:01Z

@jerch @Tyriar What are your thoughts on making v11 the default, and shipping v6 as an addon? In my eyes making v11 the default does more good than harm. Maybe a change like that should at least be considered for the next major release?

jerch · 2019-12-06T16:13:59Z

@mofux Sounds good to me, I think most common systems are already on v10+. Only systems that will suffer are older server setups, like CentOS 7 or lower, Ubuntu prior 18 LTS, not sure when Debian switched to v9+. We should at least keep V6 around as long as those are still actively maintained (at least a couple more years).

jerch · 2020-01-06T23:35:58Z

Btw, I have this script to generate stuff from Unicode table
https://github.com/fluffos/fluffos/tree/master/src/thirdparty/widecharwidth

@thefallentree Thats pretty awesome, could we refactor this to a general unicode version extractor? I have something similar written JS, but its just a bunch of scripts that still needs handwork to get the final tables extracted. I plan to add this with the next unicode PR for more versions, maybe I can use yours instead?

thefallentree · 2020-01-07T00:18:56Z

Btw, I have this script to generate stuff from Unicode table
https://github.com/fluffos/fluffos/tree/master/src/thirdparty/widecharwidth

@thefallentree Thats pretty awesome, could we refactor this to a general unicode version extractor? I have something similar written JS, but its just a bunch of scripts that still needs handwork to get the final tables extracted. I plan to add this with the next unicode PR for more versions, maybe I can use yours instead?

Let's do it! Better still we could make current python script parse and generate per unicode version of both JS and C headers that way people can freely choose to use whatever version they want to support.

jerch · 2020-01-07T14:49:12Z

Let's do it! Better still we could make current python script parse and generate per unicode version of both JS and C headers that way people can freely choose to use whatever version they want to support.

Yes, ideally we manage to make it working for any or at least the newer versions to create the wcwidth tables, maybe for different language targets. This way it could act as a solid base for any wcwidth needs. Later on it could also deal with grapheme clusters and other important Unicode stuff (once we need / support those).

thefallentree · 2020-01-08T05:13:42Z

@jerch could you also add an option here to treat ambiguous chars as wide? For example Roman Numeral Ⅱ is not rendered in full-width right now

and base on the EAW table, it is classified as ambiguous , almost all tty implementations including mintty, putty has this option to treat ambiguous as double width.

U+2161 | Ⅱ | ROMAN NUMERAL TWO

jerch · 2020-01-08T11:38:49Z

@thefallentree
Ah yes good find. We already had reports regarding this, but it kinda never made it into the code for various reasons, mainly:
The treatment of ambiguous chars as wide is seen as an ancient feature that was commonly used with legacy CJK encodings, but is now somewhat discouraged with Unicode.

I think exposing such a setting would help in circumstances, where the app side's wcwidth implementation still uses the encoding dependent switch. So yes, we should have that, downside though is ~3KB bigger source files for the ambiguous definitions. I think we should not add it with this PR, as it would introduce many more changes across the code (also to the old tables). --> Tracking issue: #2668

thefallentree · 2020-01-17T22:11:10Z

will this be merged in soon?

thefallentree · 2020-01-17T22:24:48Z

I looked at this code, but I couldn't understand how it would solve the char-breaking issue i mentioned previously in #2660

The issue here is that DOM renderer is breaking this 7 codepoint emoji into 4 s each contains an codepoint and and ZWJ, This PR just changed wcwidth, I don't think it will change that

jerch · 2020-01-19T16:01:55Z

@thefallentree Correct, this PR does not solve that issue. As I wrote above it is even worse - it is questionable if we ever can find a proper solution for that part of unicode rules for the terminal. While grapheme support itself is doable, it has to be in line with the what the other side thinks of the output representation. Currently none of the common terminal libs have proper support for that and would simply account those emojis as single consecutive chars. So this has to be fixed on both ends together to keep working.

The compound emojis have another problem (as already noted here #2664 (comment)) - they have multiple valid output representations - depending on the renderer capabilities, which makes the runwidth undetermined for app side - a clear no-go. This needs some specification of a default behavior so appside has something to work with. Again thats a problem of the terminal interface in general being incompatible with what the apple guys specced in the newer unicode versions (they dont care much even for their own system, Terminal.app will also fail badly here).

TL;DR: Dont use those compound emojis in terminals, they will screw up output in any emulator atm.

thefallentree · 2020-01-23T01:41:10Z

@jerch Clearly there is no cookie cutter solution here, I'm saying the only solution is to refactor xtermjs to provide an interface so people can set their version of desired unicode version to support. (not just wcwdith, but also character joining rules) .

That way , xtermjs can be tuned accordingly matching what the application desires. (For my websocket terminal application, I want to use unicode 12.1 and I've tuned everything with it except xtermjs)

jerch · 2020-01-24T21:35:05Z

@thefallentree Yes, I think the unicode provider in this PR is a good starting point to enhance unicode rules later on in a more version aware fashion.

Tyriar

It's marked experimental so I say let's do it 🚀

Can hopefully move VS Code over to test it out this month microsoft/vscode#89940

jerch · 2020-02-03T19:14:36Z

@Tyriar I think the provider stuff is quite solid, not so the new table, it might be subject of several changes for many entries. There is no reasonable source of truth anymore, thus I mostly went with xterm compliance (v349).
Windows with conpty is another sad story, I was not able to get far with it yet. For some reason conpty keeps to return weird things for many emojis (does weird screen handling with wrong back erasing and such), I even started to wonder whether there is some issue in node-pty. Then I started to dig into the ms terminal code, but cannot get it build either (from repo code). This def. needs some more attention, my tests showed very poor emoji support (I think all newer unicode chars are affected).

Tyriar · 2020-02-03T19:25:23Z

@jerch do we need a tracking issue for conpty emoji support?

plastikfan · 2021-04-09T14:10:50Z

Hi, just for my 2 pence worth; I have built some PowerShell functionality that heavily uses emojis and seeing undesirable results in Fluent Terminal which depends on xtermjs as opossed to the correct alignment of emojis seen in Windows Terminal. Here is an example of the same code, running in Fluent Terminal vs Windows Terminal:

Fluent Terminal:

WIndows Terminal:

The terminal in Visual Studio Code also uses xterm.js and the emoji display is similarly mis-aligned there too.

In these examples, you will notice that in Fluent Terminal, the values inside the tables are not correcly aligned within their columns, but in Windows Terminal they are. And this mis-alignment is not just left to right. Take a look at the green dot preceding 'Parameter Sets', you can see that it is slightly truncated at the top in FT but not in WT.

So I'll be looking forward to seeing this get fixed, because other than this slight issue, I love using Fluent Terminal which is a great console app.

Actually, I have a better example to illustrate this issue. Take a look at this output which comes from another command:

Fluent Terminal:
WIndows Terminal:

jerch · 2021-04-09T15:29:54Z

@plastikfan This is related to getting a proper hold of the used unicode version from windows and subsequently loading the right unicode addon into xterm.js. Currently we only have support for v6 and v11, a more versatile approach to generate unicode version addons is needed (see #2668).
My guess is that those emojis are from Unicode v12+, if you want to get involved - we'd need the current Unicode version your windows (terminal) uses to address that.

plastikfan · 2021-04-09T15:37:35Z

Thanks @jerch. To be honest, I wasnt really that aware of different versions of unicode, so your comment is enlightening to me. I just used emojis with impunity and as they were displayed ok in WT I thought nothing more of it. Its only when I started to test in FT that I began to become aware of problems with displaying emojis in different environments. Currently, I am not aware of version of Unicode supported in WT, but I can do some research and post again.

plastikfan · 2021-04-09T16:31:58Z

Its proving to be quite difficult to get definitive info about unicode version support in WIndows Terminal. However, I did find this pull request which does give us a hint. This one concerns Unicode 13, but I don't know if this is in production or not, I assume not as the pull request is still not closed, but is merged. I would assume but dont know how to confirm, that Unicode 12 in WT is actually in production.

plastikfan · 2021-04-09T16:54:34Z

Hi @jerch, I can confirm (from that pull request previously referenced) that Windows Terminal currently supports Unicode 13.

jerch · 2021-04-09T17:02:29Z

Thx for investigating. Will see if we can get a v13 addon rolling. This should fix most single emoji issues. Note that more complex things still wont work, as we currently dont support grapheme clusters (needed for compound emojis).

plastikfan · 2021-04-09T17:06:38Z

Fyi, @DHowett responded with this comment, which will mean more to you than it does to me.

jerch · 2021-04-09T17:15:16Z

Fyi, @DHowett responded with this comment, which will mean more to you than it does to me.

What he points out is clear and in fact the reason, why we have to support multiple runtime changable versions (down to v6 for older systems accessed via ssh and such).

wcwidth rules unicode 10

9c9aa68

jerch added the work-in-progress Do not merge label Nov 15, 2019

jerch self-assigned this Nov 15, 2019

jerch added 2 commits November 15, 2019 13:33

terminal option unicodeVersion

c7ad4aa

cleanup CharWidth.ts

25bbde2

jerch changed the title ~~wcwidth rules for unicode 10~~ wcwidth rules for unicode 11 Nov 15, 2019

change to v11, expose unicodeVersion, fix linkifier

aca6d24

mofux reviewed Nov 15, 2019

View reviewed changes

src/browser/Linkifier.ts Outdated Show resolved Hide resolved

Merge branch 'master' into updated_wcwidth

f81cca6

jerch added 5 commits November 26, 2019 01:59

unicode service stub

435d480

Merge branch 'updated_wcwidth' of github.com:jerch/xterm.js into upda…

a574591

…ted_wcwidth

UnicodeService & provider

3395ccb

Merge branch 'master' into updated_wcwidth

31ccb79

add interfaces and option handling

43303ba

jerch mentioned this pull request Dec 6, 2019

Unicode handling in xterm.js #1709

Closed

jerch added this to the 4.4.0 milestone Dec 6, 2019

unicode11 addon

303abf8

jerch removed the work-in-progress Do not merge label Dec 6, 2019

jerch added 3 commits December 6, 2019 16:44

remove CharWidth.ts

7945739

remove V11 from base repo

0f83d64

fix failing tests

70dabf1

Merge branch 'master' into updated_wcwidth

d4d7446

Merge branch 'master' into updated_wcwidth

3bd0cb9

Tyriar mentioned this pull request Feb 3, 2020

Adopt new unicode version provider in xterm.js microsoft/vscode#89940

Closed

Tyriar approved these changes Feb 3, 2020

View reviewed changes

Tyriar merged commit 8145829 into xtermjs:master Feb 3, 2020

jerch mentioned this pull request Feb 4, 2020

emoji/unicode support mostly broken in windows #2693

Open

danvk mentioned this pull request Mar 29, 2020

Column alignment broken with some emojis danvk/dtsearch#1

Open

plastikfan mentioned this pull request Apr 9, 2021

Unicode characters not properly displayed felixse/FluentTerminal#553

Open

plastikfan mentioned this pull request Apr 16, 2021

Show-ParameterSetInfo: columns are misaligned on Unix-like platforms EliziumNet/Loopz#146

Open

apparebit mentioned this pull request Sep 7, 2023

xterm.js lies about Unicode 11, really uses Unicode 12 #4753

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wcwidth rules for unicode 11 #2568

wcwidth rules for unicode 11 #2568

jerch commented Nov 15, 2019 •

edited

Loading

jerch commented Nov 15, 2019 •

edited

Loading

mofux commented Nov 15, 2019

jerch commented Nov 15, 2019 •

edited

Loading

Tyriar commented Nov 17, 2019

jerch commented Nov 17, 2019

Tyriar commented Nov 17, 2019

jerch commented Dec 6, 2019

mofux commented Dec 6, 2019

jerch commented Dec 6, 2019 •

edited

Loading

jerch commented Jan 6, 2020

thefallentree commented Jan 7, 2020

jerch commented Jan 7, 2020

thefallentree commented Jan 8, 2020

jerch commented Jan 8, 2020 •

edited

Loading

thefallentree commented Jan 17, 2020

thefallentree commented Jan 17, 2020

jerch commented Jan 19, 2020

thefallentree commented Jan 23, 2020

jerch commented Jan 24, 2020

Tyriar left a comment

jerch commented Feb 3, 2020 •

edited

Loading

Tyriar commented Feb 3, 2020

plastikfan commented Apr 9, 2021 •

edited

Loading

jerch commented Apr 9, 2021 •

edited

Loading

plastikfan commented Apr 9, 2021

plastikfan commented Apr 9, 2021

plastikfan commented Apr 9, 2021

jerch commented Apr 9, 2021

plastikfan commented Apr 9, 2021

jerch commented Apr 9, 2021

wcwidth rules for unicode 11 #2568

wcwidth rules for unicode 11 #2568

Conversation

jerch commented Nov 15, 2019 • edited Loading

jerch commented Nov 15, 2019 • edited Loading

mofux commented Nov 15, 2019

jerch commented Nov 15, 2019 • edited Loading

Tyriar commented Nov 17, 2019

jerch commented Nov 17, 2019

Tyriar commented Nov 17, 2019

jerch commented Dec 6, 2019

mofux commented Dec 6, 2019

jerch commented Dec 6, 2019 • edited Loading

jerch commented Jan 6, 2020

thefallentree commented Jan 7, 2020

jerch commented Jan 7, 2020

thefallentree commented Jan 8, 2020

jerch commented Jan 8, 2020 • edited Loading

thefallentree commented Jan 17, 2020

thefallentree commented Jan 17, 2020

jerch commented Jan 19, 2020

thefallentree commented Jan 23, 2020

jerch commented Jan 24, 2020

Tyriar left a comment

Choose a reason for hiding this comment

jerch commented Feb 3, 2020 • edited Loading

Tyriar commented Feb 3, 2020

plastikfan commented Apr 9, 2021 • edited Loading

jerch commented Apr 9, 2021 • edited Loading

plastikfan commented Apr 9, 2021

plastikfan commented Apr 9, 2021

plastikfan commented Apr 9, 2021

jerch commented Apr 9, 2021

plastikfan commented Apr 9, 2021

jerch commented Apr 9, 2021

jerch commented Nov 15, 2019 •

edited

Loading

jerch commented Nov 15, 2019 •

edited

Loading

jerch commented Nov 15, 2019 •

edited

Loading

jerch commented Dec 6, 2019 •

edited

Loading

jerch commented Jan 8, 2020 •

edited

Loading

jerch commented Feb 3, 2020 •

edited

Loading

plastikfan commented Apr 9, 2021 •

edited

Loading

jerch commented Apr 9, 2021 •

edited

Loading