-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wcwidth rules for unicode 11 #2568
Conversation
Differences to before:
Imho the +10KB in source code will hurt not much, since the V10 tables are somewhat repetitive to V6, I could prolly lower that to +5KB with a diff-merge. Not sure if its worth the trouble (much harder to maintain). Ideas? Interface: Perfwise I see no difference, the 2nd lookup table creation adds <1 ms during loading time. |
Tried on OSX and regular emojis won't break my prompt anymore 🎉 Combined emojis (👨👩👦👦) don't work, but they still look better than in |
@mofux Thx for testing. Yeah those compound emojis will still behave differently, its not doable with wcwidth calculations alone. This would need grapheme parsing rules + output metrics returned by the font renderer. Both is hard to achieve and cannot be made failsafe across diff. systems (even fonts), well it should not be part of this PR. (or translate this into "dont use compound enoijs in terminal apps, they prolly never gonna work reliable") Youre right with your concerns regarding addons, have not taken those into account. Guess I gonna revive the provider idea and put it into a service. |
@jerch I forgot, did the other terminals have settings for the unicode version or just try to the best fit one based on popular OS's? |
No most I looked into have one static wcwidth version, that evtl. gets updated with new releases and OS updates along the way. Only one I remember with configurable unicode settings is iTerm2. |
I think the model with the renderer works pretty well; shipping a (small) standard implementation (v6) and allowing to plug in a different one (v11). Using addons also enables the use of code splitting, while it might not make much of a difference by itself I fear death by 1000 papercuts if we keep including more and more into the core. Any opportunity for code splitting is great imo. I remember tossing the idea around of having the unicode ones special so v6 isn't loaded at all if we're not using it, that might be over complicating things and we could defer improving upon that at a later date. |
Refactored the PR to contain an UnicodeService, which maintains different Unicode versions. By default its linked with V6 (hardcoded), any other version can be added by an addon later on. Created an example addon for V11. |
@mofux Sounds good to me, I think most common systems are already on v10+. Only systems that will suffer are older server setups, like CentOS 7 or lower, Ubuntu prior 18 LTS, not sure when Debian switched to v9+. We should at least keep V6 around as long as those are still actively maintained (at least a couple more years). |
@thefallentree Thats pretty awesome, could we refactor this to a general unicode version extractor? I have something similar written JS, but its just a bunch of scripts that still needs handwork to get the final tables extracted. I plan to add this with the next unicode PR for more versions, maybe I can use yours instead? |
Let's do it! Better still we could make current python script parse and generate per unicode version of both JS and C headers that way people can freely choose to use whatever version they want to support. |
Yes, ideally we manage to make it working for any or at least the newer versions to create the wcwidth tables, maybe for different language targets. This way it could act as a solid base for any wcwidth needs. Later on it could also deal with grapheme clusters and other important Unicode stuff (once we need / support those). |
@jerch could you also add an option here to treat ambiguous chars as wide? For example Roman Numeral Ⅱ is not rendered in full-width right now and base on the EAW table, it is classified as ambiguous , almost all tty implementations including mintty, putty has this option to treat ambiguous as double width. U+2161 | Ⅱ | ROMAN NUMERAL TWO |
@thefallentree I think exposing such a setting would help in circumstances, where the app side's |
will this be merged in soon? |
I looked at this code, but I couldn't understand how it would solve the char-breaking issue i mentioned previously in #2660 The issue here is that DOM renderer is breaking this 7 codepoint emoji into 4 s each contains an codepoint and and ZWJ, This PR just changed wcwidth, I don't think it will change that |
@thefallentree Correct, this PR does not solve that issue. As I wrote above it is even worse - it is questionable if we ever can find a proper solution for that part of unicode rules for the terminal. While grapheme support itself is doable, it has to be in line with the what the other side thinks of the output representation. Currently none of the common terminal libs have proper support for that and would simply account those emojis as single consecutive chars. So this has to be fixed on both ends together to keep working. The compound emojis have another problem (as already noted here #2664 (comment)) - they have multiple valid output representations - depending on the renderer capabilities, which makes the runwidth undetermined for app side - a clear no-go. This needs some specification of a default behavior so appside has something to work with. Again thats a problem of the terminal interface in general being incompatible with what the apple guys specced in the newer unicode versions (they dont care much even for their own system, Terminal.app will also fail badly here). TL;DR: Dont use those compound emojis in terminals, they will screw up output in any emulator atm. |
@jerch Clearly there is no cookie cutter solution here, I'm saying the only solution is to refactor xtermjs to provide an interface so people can set their version of desired unicode version to support. (not just wcwdith, but also character joining rules) . That way , xtermjs can be tuned accordingly matching what the application desires. (For my websocket terminal application, I want to use unicode 12.1 and I've tuned everything with it except xtermjs) |
@thefallentree Yes, I think the unicode provider in this PR is a good starting point to enhance unicode rules later on in a more version aware fashion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's marked experimental so I say let's do it 🚀
Can hopefully move VS Code over to test it out this month microsoft/vscode#89940
@Tyriar I think the provider stuff is quite solid, not so the new table, it might be subject of several changes for many entries. There is no reasonable source of truth anymore, thus I mostly went with xterm compliance (v349). |
@jerch do we need a tracking issue for conpty emoji support? |
Hi, just for my 2 pence worth; I have built some PowerShell functionality that heavily uses emojis and seeing undesirable results in Fluent Terminal which depends on xtermjs as opossed to the correct alignment of emojis seen in Windows Terminal. Here is an example of the same code, running in Fluent Terminal vs Windows Terminal:
The terminal in Visual Studio Code also uses xterm.js and the emoji display is similarly mis-aligned there too. In these examples, you will notice that in Fluent Terminal, the values inside the tables are not correcly aligned within their columns, but in Windows Terminal they are. And this mis-alignment is not just left to right. Take a look at the green dot preceding 'Parameter Sets', you can see that it is slightly truncated at the top in FT but not in WT. So I'll be looking forward to seeing this get fixed, because other than this slight issue, I love using Fluent Terminal which is a great console app. Actually, I have a better example to illustrate this issue. Take a look at this output which comes from another command: |
@plastikfan This is related to getting a proper hold of the used unicode version from windows and subsequently loading the right unicode addon into xterm.js. Currently we only have support for v6 and v11, a more versatile approach to generate unicode version addons is needed (see #2668). |
Thanks @jerch. To be honest, I wasnt really that aware of different versions of unicode, so your comment is enlightening to me. I just used emojis with impunity and as they were displayed ok in WT I thought nothing more of it. Its only when I started to test in FT that I began to become aware of problems with displaying emojis in different environments. Currently, I am not aware of version of Unicode supported in WT, but I can do some research and post again. |
Its proving to be quite difficult to get definitive info about unicode version support in WIndows Terminal. However, I did find this pull request which does give us a hint. This one concerns Unicode 13, but I don't know if this is in production or not, I assume not as the pull request is still not closed, but is merged. I would assume but dont know how to confirm, that Unicode 12 in WT is actually in production. |
Hi @jerch, I can confirm (from that pull request previously referenced) that Windows Terminal currently supports Unicode 13. |
Thx for investigating. Will see if we can get a v13 addon rolling. This should fix most single emoji issues. Note that more complex things still wont work, as we currently dont support grapheme clusters (needed for compound emojis). |
Fyi, @DHowett responded with this comment, which will mean more to you than it does to me. |
What he points out is clear and in fact the reason, why we have to support multiple runtime changable versions (down to v6 for older systems accessed via ssh and such). |
A new attempt to solve the emoji issues. It can already be tested with this PR, but is still in a very early state.
Decided not to go the full unicode provider path, gonna start just with wcwidth table replacements. This way we can still deliver the old and the new versions in core, that can be selected by some option later on. Still the unicode provider approach of earlier attempts might get relevant again, if we go for other unicode features in the future (e.g. bidi, grapheme).Edit: UnicodeService added, V11 comes as addon implementation.This PR gonna fix most emoji issues on newer systems. There are some newer unicode compose rules, that cannot be covered by simple wcwidth lookups. If we want those, we would have to implement true grapheme support. I already investigated in this direction some months ago with the result that this is not respected by any console program either. Thus not worth the impl headache atm.
TODO:
shrink data tables (create merge-diff to old table?)not maintainable in the long run@Tyriar, @mofux Could you test under macOS and Windows if the emojis get the right width? (Still have no clue what Unicode versions these use for a certain OS version)
Closes #1709.