Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pasting Unicode characters in Cygwin Bash loses the first element of the surrogate pair #8302

Closed
Mariusz-W opened this issue Nov 17, 2020 · 12 comments
Labels
Needs-Attention The core contributors need to come back around and look at this ASAP. Needs-Tag-Fix Doesn't match tag requirements Resolution-External For issues that are outside this codebase

Comments

@Mariusz-W
Copy link

GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
LC_CTYPE=en_US.UTF-8

Windows Terminal
Version: 1.4.3141.0

Attempting to paste, with the mouse right button or with Ctrl-Shift-V, a Unicode character from the Unicode Supplemental Planes in Cygwin bash shell has the following effect.

Instead of pasting the character, it pastes the 2nd element of the surrogate pair representing that character in the UTF-16BE encoding.

Example: attempting to paste 🀄 it pastes instead �

Character 🀄 has the Unicode code point: 1F004
UTF-16BE encodes 🀄 as the surrogate pair: D83C DC04

I see the 2nd surrogate element directly, when I paste the character within Cygwin ‘vim’.

I see the 2nd surrogate element re-encoded as the corresponding UTF-8 three bytes sequence, when I paste the character into the Cygwin bash command line. In the above example that happens to be the sequence : EF BF BD .

This suggest that the problem is caused by the fact that the 1st element of the surrogate pair is not passed by the WindowsTerminal to Cygwin Bash along with the 2nd element. If both elements were passed, the problem would likely go away.

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Nov 17, 2020
@Mariusz-W
Copy link
Author

Post scriptum. The github engine substituted the surrogate 2-byte sequence present in the text of my bug report with the � character (which is the Unicode REPLACEMENT CHARACTER, code point FFFD) .

@DHowett
Copy link
Member

DHowett commented Nov 17, 2020

Which version of the Cygwin runtime are you using?

@DHowett DHowett added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Nov 17, 2020
@Mariusz-W
Copy link
Author

Cygwin 3.1.7(0.340/5/3)

@ghost ghost added Needs-Attention The core contributors need to come back around and look at this ASAP. and removed Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something labels Nov 18, 2020
@DHowett
Copy link
Member

DHowett commented Nov 20, 2020

So, this is really baffling to me. It works in PowerShell Core, ConEchoKey, and WSL.

When I look at ConEchoKey's output, I get the right data:

Down: 0 Repeat: 1 KeyCode: 0x12 ScanCode: 0x38 Char: ?  (0xd83c) KeyState: 0x0
Down: 0 Repeat: 1 KeyCode: 0x12 ScanCode: 0x38 Char: ?  (0xdc04) KeyState: 0x0

It's coming in as the two halves of the surrogate pair, even when ConEchoKey reads one unit at a time. That's totally what I'd expect.

I'm going to chalk this one up to application error. Cygwin has, perhaps, never had to support somebody inputting high unicode?

@DHowett DHowett closed this as completed Nov 20, 2020
@DHowett DHowett added Resolution-External For issues that are outside this codebase and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Nov 20, 2020
@Mariusz-W
Copy link
Author

Mariusz-W commented Nov 20, 2020

This is not Cygwin’s Bash error. In Cygwin’s default terminal application, mintty, everythying works perfectly. I am constantly using and inputting Unicode characters from higher planes, and there are no problems with copying/pasting those characters. Cygwin Bash has full support for all the Unicode planes. I had to return back to using mintty precisely because of the issue I reported with pasting in WindowsTerminal. What I like about WindowsTerminal is that it has the capacity to substitute the Unicode characters, missing in the default font, from some proportional fonts, while other terminal programs are limited to using strictly monospaced fonts. I would love to learn more about what mechanism WindowsTerminal employs in order to achieve that. That is really very useful.

Returning to Pasting: there are also problems with pasting Unicode characters from higher planes into WindowsTerminal CMD shell: the pasted character is displayed wrongly as a surrogate pair. Pasting Unicode characters from the Basic Plane works in WindowsTerminal CMD shell as expected.

In WindowsTerminal Powershell pasting is broken even more, and also for Unicode characters from the Basic Plane, as the following example shows:

PS C:\WINDOWS\System32> [char]0xf900

Now copying 豈 and pasting it back at the command prompt , WindowsTerminal displays it as the question mark

PS C:\WINDOWS\System32> ?

If I hit «Enter» to check how Powershell interprets what I pasted, I see that Powershell sees it as ‘豈’ , but if I copy the character under ‘?’ I get simply the question mark.

@Mariusz-W
Copy link
Author

Mariusz-W commented Nov 20, 2020

“It works in PowerShell Core, ConEchoKey, and WSL.

When I look at ConEchoKey's output, I get the right data:

Down: 0 Repeat: 1 KeyCode: 0x12 ScanCode: 0x38 Char: ? (0xd83c) KeyState: 0x0
Down: 0 Repeat: 1 KeyCode: 0x12 ScanCode: 0x38 Char: ? (0xdc04) KeyState: 0x0

Pasting in WindowsTerminal PowerShell has its own problems which are even more critical, as I described them in my previous message. Powershell displays Unicode Basic Plane characters and surrogates as question marks, and this is not just the visual representation: if you try to copy what you paste, you copy two question mark characters.

@DHowett
Copy link
Member

DHowett commented Nov 20, 2020

Alright, so I got this under a debugger. The client (while I am attached to the console host) is using cygwin 3.1's runtime.

Here's what the events that we return to Cygwin look like:

0:001> dx (OpenConsole!KeyEvent*)readEvents[5]->_Mypair->_Myval2,!
(OpenConsole!KeyEvent*)readEvents[5]->_Mypair->_Myval2,!                 : 0x2679624e6e0 [Type: KeyEvent *]
    [+0x008] _keyDown         : false [Type: bool]
    [+0x00a] _repeatCount     : 0x1 [Type: unsigned short]
    [+0x00c] _virtualKeyCode  : 0x12 [Type: unsigned short]
    [+0x00e] _virtualScanCode : 0x38 [Type: unsigned short]
    [+0x010] _charData        : -10180 [Type: wchar_t]
    [+0x014] _activeModifierKeys : None | Alphanumeric (0x0) [Type: KeyEvent::Modifiers]
0:001> dx (OpenConsole!KeyEvent*)readEvents[11]->_Mypair->_Myval2,!
(OpenConsole!KeyEvent*)readEvents[11]->_Mypair->_Myval2,!                 : 0x2679624e4e0 [Type: KeyEvent *]
    [+0x008] _keyDown         : false [Type: bool]
    [+0x00a] _repeatCount     : 0x1 [Type: unsigned short]
    [+0x00c] _virtualKeyCode  : 0x12 [Type: unsigned short]
    [+0x00e] _virtualScanCode : 0x38 [Type: unsigned short]
    [+0x010] _charData        : -9212 [Type: wchar_t]
    [+0x014] _activeModifierKeys : None | Alphanumeric (0x0) [Type: KeyEvent::Modifiers]

_charData from events 5 and 11 (you have to ignore the middle ones -- modifiers being pressed and released) is -10180 and -9212 which come out in 2's complement as d83c and dc04.

As far as we can tell, the events are being generated into Cygwin's address space properly with both halves of the surrogate pair.

Unfortunately, mintty is not properly representative of Cygwin's console interoperability. It has a backchannel that it uses directly with mintty, and it doesn't go through the console APIs.

CMD is tracked in #7777, and PowerShell is a wildcard to me. It works here, but I found a report on PSReadline (which handles powershell's input) that some folks have trouble with it: PowerShell/PSReadLine#1329

With the debugging done, and the input buffer getting all the way into Cygwin . . . I'm ready to call this an application issue. 😄

@Mariusz-W
Copy link
Author

First, Thank you for your continuing investigating the pasting issues. I am not sure I understand your remark about mintty. The latter has been embraced both by the maintainers and by the users as the default terminal for Cygwin. It is Cygwin’s “mainstream” so to speak, not some hack by an external enthusiast. I don’t know anybody who would be using the old Cygwin console for any work requiring Unicode. The pasting issue is clearly in the area how the terminal application interacts with Cygwin or Cygwin Bash. If the mintty’s developer faced the same issue, he clearly knew how to solve it. Assuming you are in charge of maintaining WindowsTerminal, it would be a great idea to contact the maintainers of Cygwin to find out from them how to fix the communication between Windows Terminal and Cygwin. WindowsTerminal will be quickly adopted by a lot of people who have been using Cygwin for many years as a great alternative to mintty, provided all the basic issues are resolved. At the moment WindowsTerminal has some advantages over mintty, like ability to substitute missing characters from fonts that are not monospaced (where can I learn more about how that is achieved?).

I will be grateful also for fixing pasting into Powershell and CMD. At the moment editing any text files inside of Powershell, that contain Unicode characters is simply impossible. I have such characters in nearly every file I write/edit. With CMD the problem is less severe, the surrogate pairs are not properly displayed but they are nevertheless there and they can be properly copied.

A separate but closely related issue: make the Vim people aware that Vim on Windows is now Vim in WindowsTerminal. The current Vim Windows binaries are completely broken in WindowsTerminal if I paste any Unicode character, and I don’t know where the problem lies.

@ericrbg
Copy link

ericrbg commented Nov 21, 2020

@Mariusz-W I've had the same issue with Vim, even if I used the surrogate pair outside the vim session. It completely borks my terminal to open any files to look at them :( All I want is a snake for my virtualenv prompt lol

@DHowett
Copy link
Member

DHowett commented Nov 23, 2020

To, to address a couple of your points... it's complicated.

I've got good news, though. Cygwin on Windows Terminal and Vim on Windows Terminal are Cygwin and Vim on the traditional Windows console. We have had support for xterm-compatible control sequences for five years now. It's been backwards compatible for that whole time, too. Five years is a long time for the application and framework authors to support these things!

Windows Terminal builds on that support, but at its heart it is still the Windows console code. That code (in this repository, it's in src/host, src/server, etc.) actually goes right back into Windows! Everything Terminal supports today, the traditional Windows console will support "tomorrow". Not exactly tomorrow because code moves somewhat slowly in the Windows world, but tomorrow-ish.

We've been in touch with both projects, and they're to some extent aware of this. 😄
They are likely more aware now than they used to be, because Windows Terminal is high-profile. That VT support has still existed for years though...

My point about mintty was less that it was bad (it isn't!). It was more that mintty and Cygwin have a contract where Cygwin gives them special treatment (reading and writing data in its preferred format). Cygwin fully supports targeting the Windows console infrastructure, it just has to use the Windows Console APIs to do it. It is their integration with those APIs that is not working properly.

If the mintty’s developer faced the same issue, he clearly knew how to solve it.

Because mintty and Cygwin use that special communication channel, they probably never had that issue.

Our communication channel is the same one that Windows has had for the past two decades, not a "new" one that they need to learn to integrate with.

The recommend mintty because, for a very long time, the console simply didn't support the things they needed. It's only recently (again, ~5 years) that it has! Cygwin 3.1 recently came out, and they finally took a dependency on us for VT support. I can only hope that their subsequent revisions will improve the state of their integration with the console. 😁

(note: vim is largely the same -- they have a Win32 compatibility layer that is no longer necessary, but replacing it with native VT support is time-consuming work!)

Hope that helps!

@Mariusz-W
Copy link
Author

Thank you for clarifying the situation.

I am hoping that WindowsTerminal receives the same “preferential treatment” from the Cygwin team as mintty does, since it has some advantages over mintty. The result is that I use daily both, often in parallel. A few of the “annoyances” of using Cygwin bash in WindowsTerminal:

  1. When copying and pasting the directory path from the Windows Explorer (file manager) address bar , mintty will convert the Windows path to the Cygwin path, which is very-very helpful. An example:

if the path is:

F:\Classes

mintty will paste it as:

/cygdrive/f/Classes

WindowsTerminal just pastes

F:\Classes

which, without surrounding the whole path in quotes or manually transforming it into a valid Cygwin path, is unusable.

  1. In mintty the middle mouse button allows me to scroll inside the buffer I am editing in Vim. In WindowsTerminal, the same Cygwin Vim in the same Cygwin Bash can’t do that. What does that indicate? Doesn’t WindowsTerminal xterm-compatibility extend enough to cover this feature?

I really hope WindowsTerminal’s “high visibility” does not diminish and grows even further, so that both the maintainer of the Windows Console Vim binary , and the Cygwin team, feel morally obliged to make necessary steps to fully embrace WindowsTerminal.

I will take a look at the WindowsTerminal sources, because I am very curious how WindowsTerminal solves the font substitution problem. This is an issue of greatest interest to me, because the files I create contain, by default, characters from many different Unicode blocks. Nearly for anything that doesn’t require extensive Unicode support I am still using the regular Windows Console a lot.

@Mariusz-W
Copy link
Author

The mintty developer provided me with some details about how mintty is handling the clipboard — the whole conversion from/to UTF-16 is handled by mintty alone. According to him “(t)he Cygwin console handler code does not handle” characters that Windows encodes with surrogate pairs. This seems then to be the root cause of the pasting problem reported here. It is thus an urgent matter to file a bug report with Cygwin. The first Supplemental Plane is now being filled with very important stuff, all of the Mathematical alphabets and numerous symbols are there, DejavuSans, Microsoft Cambria, Stix (a joint project of major academic publishing houses, Amerarican Mathematical and Physical Societies), fully support them. They are practically unusable in WIndowsTerminal until the pasting is handled properly. Also practically all of the Ancient scripts that are essential for the philological work. I don’t mention emojis because for me they have no value but I am sure a lot of people whose brain is totally immersed in social media could be asking for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs-Attention The core contributors need to come back around and look at this ASAP. Needs-Tag-Fix Doesn't match tag requirements Resolution-External For issues that are outside this codebase
Projects
None yet
Development

No branches or pull requests

3 participants