Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wcwidth rules for unicode 11 #2568

Merged
merged 29 commits into from
Feb 3, 2020
Merged

wcwidth rules for unicode 11 #2568

merged 29 commits into from
Feb 3, 2020

Conversation

jerch
Copy link
Member

@jerch jerch commented Nov 15, 2019

A new attempt to solve the emoji issues. It can already be tested with this PR, but is still in a very early state.

Decided not to go the full unicode provider path, gonna start just with wcwidth table replacements. This way we can still deliver the old and the new versions in core, that can be selected by some option later on. Still the unicode provider approach of earlier attempts might get relevant again, if we go for other unicode features in the future (e.g. bidi, grapheme). Edit: UnicodeService added, V11 comes as addon implementation.

This PR gonna fix most emoji issues on newer systems. There are some newer unicode compose rules, that cannot be covered by simple wcwidth lookups. If we want those, we would have to implement true grapheme support. I already investigated in this direction some months ago with the result that this is not respected by any console program either. Thus not worth the impl headache atm.

TODO:

  • shrink data tables (create merge-diff to old table?) not maintainable in the long run
  • cleanup

@Tyriar, @mofux Could you test under macOS and Windows if the emojis get the right width? (Still have no clue what Unicode versions these use for a certain OS version)

Closes #1709.

@jerch jerch added the work-in-progress Do not merge label Nov 15, 2019
@jerch jerch self-assigned this Nov 15, 2019
@jerch
Copy link
Member Author

jerch commented Nov 15, 2019

Differences to before:

  • +65KB lookup table at runtime
  • +10KB source code

Imho the +10KB in source code will hurt not much, since the V10 tables are somewhat repetitive to V6, I could prolly lower that to +5KB with a diff-merge. Not sure if its worth the trouble (much harder to maintain). Ideas?

Interface: CharWidth.ts now exports wcwidth and getStringCellWidth versioned as ...V6 or ...V10. Its the callers responsibility to choose the right version depending on options.unicodeVersion. This is already implemented in InputHandler.

Perfwise I see no difference, the 2nd lookup table creation adds <1 ms during loading time.

@jerch jerch changed the title wcwidth rules for unicode 10 wcwidth rules for unicode 11 Nov 15, 2019
src/browser/Linkifier.ts Outdated Show resolved Hide resolved
@mofux
Copy link
Contributor

mofux commented Nov 15, 2019

Tried on OSX and regular emojis won't break my prompt anymore 🎉 Combined emojis (👨‍👩‍👦‍👦) don't work, but they still look better than in Terminal.app and iTerm 👍

@jerch
Copy link
Member Author

jerch commented Nov 15, 2019

@mofux Thx for testing. Yeah those compound emojis will still behave differently, its not doable with wcwidth calculations alone. This would need grapheme parsing rules + output metrics returned by the font renderer. Both is hard to achieve and cannot be made failsafe across diff. systems (even fonts), well it should not be part of this PR. (or translate this into "dont use compound enoijs in terminal apps, they prolly never gonna work reliable")

Youre right with your concerns regarding addons, have not taken those into account. Guess I gonna revive the provider idea and put it into a service.

@Tyriar
Copy link
Member

Tyriar commented Nov 17, 2019

@jerch I forgot, did the other terminals have settings for the unicode version or just try to the best fit one based on popular OS's?

@jerch
Copy link
Member Author

jerch commented Nov 17, 2019

No most I looked into have one static wcwidth version, that evtl. gets updated with new releases and OS updates along the way. Only one I remember with configurable unicode settings is iTerm2.

@Tyriar
Copy link
Member

Tyriar commented Nov 17, 2019

Youre right with your concerns regarding addons, have not taken those into account. Guess I gonna revive the provider idea and put it into a service.

I think the model with the renderer works pretty well; shipping a (small) standard implementation (v6) and allowing to plug in a different one (v11). Using addons also enables the use of code splitting, while it might not make much of a difference by itself I fear death by 1000 papercuts if we keep including more and more into the core. Any opportunity for code splitting is great imo.

I remember tossing the idea around of having the unicode ones special so v6 isn't loaded at all if we're not using it, that might be over complicating things and we could defer improving upon that at a later date.

@jerch jerch added this to the 4.4.0 milestone Dec 6, 2019
@jerch jerch removed the work-in-progress Do not merge label Dec 6, 2019
@jerch
Copy link
Member Author

jerch commented Dec 6, 2019

Refactored the PR to contain an UnicodeService, which maintains different Unicode versions. By default its linked with V6 (hardcoded), any other version can be added by an addon later on. Created an example addon for V11.

@mofux
Copy link
Contributor

mofux commented Dec 6, 2019

@jerch @Tyriar What are your thoughts on making v11 the default, and shipping v6 as an addon? In my eyes making v11 the default does more good than harm. Maybe a change like that should at least be considered for the next major release?

@jerch
Copy link
Member Author

jerch commented Dec 6, 2019

@mofux Sounds good to me, I think most common systems are already on v10+. Only systems that will suffer are older server setups, like CentOS 7 or lower, Ubuntu prior 18 LTS, not sure when Debian switched to v9+. We should at least keep V6 around as long as those are still actively maintained (at least a couple more years).

@jerch
Copy link
Member Author

jerch commented Jan 6, 2020

Btw, I have this script to generate stuff from Unicode table
https://github.com/fluffos/fluffos/tree/master/src/thirdparty/widecharwidth

@thefallentree Thats pretty awesome, could we refactor this to a general unicode version extractor? I have something similar written JS, but its just a bunch of scripts that still needs handwork to get the final tables extracted. I plan to add this with the next unicode PR for more versions, maybe I can use yours instead?

@thefallentree
Copy link
Contributor

Btw, I have this script to generate stuff from Unicode table
https://github.com/fluffos/fluffos/tree/master/src/thirdparty/widecharwidth

@thefallentree Thats pretty awesome, could we refactor this to a general unicode version extractor? I have something similar written JS, but its just a bunch of scripts that still needs handwork to get the final tables extracted. I plan to add this with the next unicode PR for more versions, maybe I can use yours instead?

Let's do it! Better still we could make current python script parse and generate per unicode version of both JS and C headers that way people can freely choose to use whatever version they want to support.

@jerch
Copy link
Member Author

jerch commented Jan 7, 2020

Let's do it! Better still we could make current python script parse and generate per unicode version of both JS and C headers that way people can freely choose to use whatever version they want to support.

Yes, ideally we manage to make it working for any or at least the newer versions to create the wcwidth tables, maybe for different language targets. This way it could act as a solid base for any wcwidth needs. Later on it could also deal with grapheme clusters and other important Unicode stuff (once we need / support those).

@thefallentree
Copy link
Contributor

@jerch could you also add an option here to treat ambiguous chars as wide? For example Roman Numeral Ⅱ is not rendered in full-width right now

and base on the EAW table, it is classified as ambiguous , almost all tty implementations including mintty, putty has this option to treat ambiguous as double width.

U+2161 | Ⅱ | ROMAN NUMERAL TWO

image

@jerch
Copy link
Member Author

jerch commented Jan 8, 2020

@thefallentree
Ah yes good find. We already had reports regarding this, but it kinda never made it into the code for various reasons, mainly:
The treatment of ambiguous chars as wide is seen as an ancient feature that was commonly used with legacy CJK encodings, but is now somewhat discouraged with Unicode.

I think exposing such a setting would help in circumstances, where the app side's wcwidth implementation still uses the encoding dependent switch. So yes, we should have that, downside though is ~3KB bigger source files for the ambiguous definitions. I think we should not add it with this PR, as it would introduce many more changes across the code (also to the old tables). --> Tracking issue: #2668

@thefallentree
Copy link
Contributor

will this be merged in soon?

@thefallentree
Copy link
Contributor

I looked at this code, but I couldn't understand how it would solve the char-breaking issue i mentioned previously in #2660

The issue here is that DOM renderer is breaking this 7 codepoint emoji into 4 s each contains an codepoint and and ZWJ, This PR just changed wcwidth, I don't think it will change that

@jerch
Copy link
Member Author

jerch commented Jan 19, 2020

@thefallentree Correct, this PR does not solve that issue. As I wrote above it is even worse - it is questionable if we ever can find a proper solution for that part of unicode rules for the terminal. While grapheme support itself is doable, it has to be in line with the what the other side thinks of the output representation. Currently none of the common terminal libs have proper support for that and would simply account those emojis as single consecutive chars. So this has to be fixed on both ends together to keep working.

The compound emojis have another problem (as already noted here #2664 (comment)) - they have multiple valid output representations - depending on the renderer capabilities, which makes the runwidth undetermined for app side - a clear no-go. This needs some specification of a default behavior so appside has something to work with. Again thats a problem of the terminal interface in general being incompatible with what the apple guys specced in the newer unicode versions (they dont care much even for their own system, Terminal.app will also fail badly here).

TL;DR: Dont use those compound emojis in terminals, they will screw up output in any emulator atm.

@thefallentree
Copy link
Contributor

@jerch Clearly there is no cookie cutter solution here, I'm saying the only solution is to refactor xtermjs to provide an interface so people can set their version of desired unicode version to support. (not just wcwdith, but also character joining rules) .

That way , xtermjs can be tuned accordingly matching what the application desires. (For my websocket terminal application, I want to use unicode 12.1 and I've tuned everything with it except xtermjs)

@jerch
Copy link
Member Author

jerch commented Jan 24, 2020

@thefallentree Yes, I think the unicode provider in this PR is a good starting point to enhance unicode rules later on in a more version aware fashion.

Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's marked experimental so I say let's do it 🚀

Can hopefully move VS Code over to test it out this month microsoft/vscode#89940

@Tyriar Tyriar merged commit 8145829 into xtermjs:master Feb 3, 2020
@jerch
Copy link
Member Author

jerch commented Feb 3, 2020

@Tyriar I think the provider stuff is quite solid, not so the new table, it might be subject of several changes for many entries. There is no reasonable source of truth anymore, thus I mostly went with xterm compliance (v349).
Windows with conpty is another sad story, I was not able to get far with it yet. For some reason conpty keeps to return weird things for many emojis (does weird screen handling with wrong back erasing and such), I even started to wonder whether there is some issue in node-pty. Then I started to dig into the ms terminal code, but cannot get it build either (from repo code). This def. needs some more attention, my tests showed very poor emoji support (I think all newer unicode chars are affected).

@Tyriar
Copy link
Member

Tyriar commented Feb 3, 2020

@jerch do we need a tracking issue for conpty emoji support?

@plastikfan
Copy link

plastikfan commented Apr 9, 2021

Hi, just for my 2 pence worth; I have built some PowerShell functionality that heavily uses emojis and seeing undesirable results in Fluent Terminal which depends on xtermjs as opossed to the correct alignment of emojis seen in Windows Terminal. Here is an example of the same code, running in Fluent Terminal vs Windows Terminal:

  • Fluent Terminal:

fluent-terminal emojis-bad

  • WIndows Terminal:

windows-terminal emojis-ok

The terminal in Visual Studio Code also uses xterm.js and the emoji display is similarly mis-aligned there too.

In these examples, you will notice that in Fluent Terminal, the values inside the tables are not correcly aligned within their columns, but in Windows Terminal they are. And this mis-alignment is not just left to right. Take a look at the green dot preceding 'Parameter Sets', you can see that it is slightly truncated at the top in FT but not in WT.

So I'll be looking forward to seeing this get fixed, because other than this slight issue, I love using Fluent Terminal which is a great console app.

Actually, I have a better example to illustrate this issue. Take a look at this output which comes from another command:

  • Fluent Terminal:
    fluent-terminal emojis-bad signals

  • WIndows Terminal:
    windows-terminal emojis-ok signals

@jerch
Copy link
Member Author

jerch commented Apr 9, 2021

@plastikfan This is related to getting a proper hold of the used unicode version from windows and subsequently loading the right unicode addon into xterm.js. Currently we only have support for v6 and v11, a more versatile approach to generate unicode version addons is needed (see #2668).
My guess is that those emojis are from Unicode v12+, if you want to get involved - we'd need the current Unicode version your windows (terminal) uses to address that.

@plastikfan
Copy link

Thanks @jerch. To be honest, I wasnt really that aware of different versions of unicode, so your comment is enlightening to me. I just used emojis with impunity and as they were displayed ok in WT I thought nothing more of it. Its only when I started to test in FT that I began to become aware of problems with displaying emojis in different environments. Currently, I am not aware of version of Unicode supported in WT, but I can do some research and post again.

@plastikfan
Copy link

Its proving to be quite difficult to get definitive info about unicode version support in WIndows Terminal. However, I did find this pull request which does give us a hint. This one concerns Unicode 13, but I don't know if this is in production or not, I assume not as the pull request is still not closed, but is merged. I would assume but dont know how to confirm, that Unicode 12 in WT is actually in production.

@plastikfan
Copy link

Hi @jerch, I can confirm (from that pull request previously referenced) that Windows Terminal currently supports Unicode 13.

@jerch
Copy link
Member Author

jerch commented Apr 9, 2021

Thx for investigating. Will see if we can get a v13 addon rolling. This should fix most single emoji issues. Note that more complex things still wont work, as we currently dont support grapheme clusters (needed for compound emojis).

@plastikfan
Copy link

Fyi, @DHowett responded with this comment, which will mean more to you than it does to me.

@jerch
Copy link
Member Author

jerch commented Apr 9, 2021

Fyi, @DHowett responded with this comment, which will mean more to you than it does to me.

What he points out is clear and in fact the reason, why we have to support multiple runtime changable versions (down to v6 for older systems accessed via ssh and such).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode handling in xterm.js
5 participants