Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious spaces appear when printing some character from Unicode Private Use Area #15086

Closed
romkatv opened this issue Apr 3, 2023 · 12 comments
Closed
Labels
Issue-Bug It either shouldn't be doing this or needs an investigation. Resolution-Duplicate There's another issue on the tracker that's pretty much the same thing.

Comments

@romkatv
Copy link

romkatv commented Apr 3, 2023

Windows Terminal version

1.16.10261.0

Windows build number

10.0.19045.0

Other Software

WSL

Steps to reproduce

  1. Open bash or zsh in WSL in Windows Terminal.
  2. Run this command: printf '\UF0737\033[41mx\033[0m\n'

Expected Behavior

The output of the command should occupy two columns. The content of the first column is unspecified (it depends on your font). The second column should contain x.

image

Actual Behavior

The output occupies 3 columns: there is an extra space in the middle.

image

It may appear that the space is a part of the first character. This, however, is not the case, as can be demonstrated by running printf '\UF0737x\033[41my\033[0m\n'.

image

Not all characters from Unicode Private Use Area exhibit this issue. For example, printf '\UE617\033[41mx\033[0m\n' works as intended.

@romkatv romkatv added Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Apr 3, 2023
@romkatv
Copy link
Author

romkatv commented Apr 3, 2023

As I mentioned above, font doesn't matter. To avoid confusion, here's a screenshot of all commands with Consolas:

image

The output of the last command is as expected. The output of the first two commands is incorrect (the space in the middle should not be there).

@romkatv
Copy link
Author

romkatv commented Apr 3, 2023

Conhost.exe also suffers from this issue but differently.

image

The output of the second command is different from Windows Terminal but also incorrect.

@lhecker
Copy link
Member

lhecker commented Apr 3, 2023

This is a well-known issue that is very, very difficult to resolve, because it requires undoing like 2 decades of code built on UCS2 assumptions. In other words, this happens, because your code points are surrogate pairs and this code base assumes that each UTF-16 character is at least 1 column wide. A surrogate pair can thus not be narrower than 2 columns. I'm actively working on this issue however. It's a duplicate of #3546.

@romkatv
Copy link
Author

romkatv commented Apr 3, 2023

Thanks for the link. This explains the output of printf '\UF0737x\033[41my\033[0m\n' in conhost.exe. However, the output in Windows Terminal is different, which suggests that it's doing something special. Could you give a hint that would explain the output of this command in Windows Terminal?

@237dmitry
Copy link

This depends on font and perhaps on Atlas Engine (enabled or not):

Screenshot 2023-04-03 204154

@lhecker
Copy link
Member

lhecker commented Apr 3, 2023

To be honest, I'm not 100% sure where the different behavior is coming from, and I don't think it's easy to determine. Your Windows 10 version uses a much much older version of the text processing code than Windows Terminal 1.16 and so there's a huge number of places that might be responsible for this.

I've just tested your repro on Windows Terminal Preview (1.17) by the way and it appears it doesn't reproduce anymore:
image

It doesn't matter whether I have AtlasEngine enabled or not. I'm pretty sure it was fixed by PR #14640, because it closes a suspiciously similar issue: #6162.

Since #6162 is so similar I'll close this issue as a duplicate. /dup #6162

@microsoft-github-policy-service
Copy link
Contributor

Hi! We've identified this issue as a duplicate of another one that already exists on this Issue Tracker. This specific instance is being closed in favor of tracking the concern over on the referenced thread. Thanks for your report!

@microsoft-github-policy-service microsoft-github-policy-service bot added Resolution-Duplicate There's another issue on the tracker that's pretty much the same thing. and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Apr 3, 2023
@lhecker
Copy link
Member

lhecker commented Apr 3, 2023

BTW I should add that you'll find many more similar issues around our Unicode support, because what I said previously unfortunately still applies. It's one of my top priorities to address this. If you find any other Unicode issues, please do feel free to file more issues on us however!

@romkatv
Copy link
Author

romkatv commented Apr 3, 2023

This depends on font and perhaps on Atlas Engine

As I mentioned above, this does not depend on font. I didn't mention Atlas Engine but the answer is the same: it does not depend on it.

I've just tested your repro on Windows Terminal Preview (1.17) by the way and it appears it doesn't reproduce anymore

That's great to hear, and it makes a lot more sense than "this code base assumes that each UTF-16 character is at least 1 column wide", which contradicted my observations.

@DHowett
Copy link
Member

DHowett commented Apr 3, 2023

"this code base assumes that each UTF-16 character is at least 1 column wide"

You know, this is pretty close to the truth today.

Up until Windows Terminal 1.17, the text buffer assumed that each UTF-16 code unit¹ was at least one column wide.

Beyond 1.17, the text buffer assumes that each UTF-16 code point is at least one column wide. That is, we don't support zero-width characters or grapheme clusters composed of multiple code points.

¹ This is, of course, where "surrogate pairs require at least two columns" comes from. 🙂

@romkatv
Copy link
Author

romkatv commented Apr 5, 2023

This doesn't sound like the full story. Here's what I'm seeing in Windows Terminal 1.16.10261.0.

image

As you can see, U+F0737 takes just one column.

Anyway, I'm glad that this issue is fixed in the future version. I'll eagerly await until my PC picks it up.

@DHowett
Copy link
Member

DHowett commented Apr 5, 2023

Good observation!

Now, for the real secret. The rendering engine in 1.16 hasn't been informed about which columns to put which characters in, so it renders everything of the same color in a single run that gets compressed down to the advance width of every glyph included in that run.

If you add another color, it suddenly snaps that new run of text to the correct position:

image

this results in a couple of fun things:

an emoji composed of a number of joiners takes up 5, 7, or 9 columns

image

a line that contains mis-measured characters wraps at the wrong width

image

(this has another bug in it, from some 100-codeunit buffer we have also gotten rid of recently; plus, I realize that I broke the \U escape)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Bug It either shouldn't be doing this or needs an investigation. Resolution-Duplicate There's another issue on the tracker that's pretty much the same thing.
Projects
None yet
Development

No branches or pull requests

4 participants