-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pasting Unicode characters in Cygwin Bash loses the first element of the surrogate pair #8302
Comments
Post scriptum. The github engine substituted the surrogate 2-byte sequence present in the text of my bug report with the � character (which is the Unicode REPLACEMENT CHARACTER, code point FFFD) . |
Which version of the Cygwin runtime are you using? |
Cygwin 3.1.7(0.340/5/3) |
So, this is really baffling to me. It works in PowerShell Core, ConEchoKey, and WSL. When I look at ConEchoKey's output, I get the right data:
It's coming in as the two halves of the surrogate pair, even when ConEchoKey reads one unit at a time. That's totally what I'd expect. I'm going to chalk this one up to application error. Cygwin has, perhaps, never had to support somebody inputting high unicode? |
This is not Cygwin’s Bash error. In Cygwin’s default terminal application, mintty, everythying works perfectly. I am constantly using and inputting Unicode characters from higher planes, and there are no problems with copying/pasting those characters. Cygwin Bash has full support for all the Unicode planes. I had to return back to using mintty precisely because of the issue I reported with pasting in WindowsTerminal. What I like about WindowsTerminal is that it has the capacity to substitute the Unicode characters, missing in the default font, from some proportional fonts, while other terminal programs are limited to using strictly monospaced fonts. I would love to learn more about what mechanism WindowsTerminal employs in order to achieve that. That is really very useful. Returning to Pasting: there are also problems with pasting Unicode characters from higher planes into WindowsTerminal CMD shell: the pasted character is displayed wrongly as a surrogate pair. Pasting Unicode characters from the Basic Plane works in WindowsTerminal CMD shell as expected. In WindowsTerminal Powershell pasting is broken even more, and also for Unicode characters from the Basic Plane, as the following example shows: PS C:\WINDOWS\System32> [char]0xf900 Now copying 豈 and pasting it back at the command prompt , WindowsTerminal displays it as the question mark PS C:\WINDOWS\System32> ? If I hit «Enter» to check how Powershell interprets what I pasted, I see that Powershell sees it as ‘豈’ , but if I copy the character under ‘?’ I get simply the question mark. |
“It works in PowerShell Core, ConEchoKey, and WSL. When I look at ConEchoKey's output, I get the right data: Down: 0 Repeat: 1 KeyCode: 0x12 ScanCode: 0x38 Char: ? (0xd83c) KeyState: 0x0 Pasting in WindowsTerminal PowerShell has its own problems which are even more critical, as I described them in my previous message. Powershell displays Unicode Basic Plane characters and surrogates as question marks, and this is not just the visual representation: if you try to copy what you paste, you copy two question mark characters. |
Alright, so I got this under a debugger. The client (while I am attached to the console host) is using cygwin 3.1's runtime. Here's what the events that we return to Cygwin look like:
As far as we can tell, the events are being generated into Cygwin's address space properly with both halves of the surrogate pair. Unfortunately, mintty is not properly representative of Cygwin's console interoperability. It has a backchannel that it uses directly with mintty, and it doesn't go through the console APIs. CMD is tracked in #7777, and PowerShell is a wildcard to me. It works here, but I found a report on PSReadline (which handles powershell's input) that some folks have trouble with it: PowerShell/PSReadLine#1329 With the debugging done, and the input buffer getting all the way into Cygwin . . . I'm ready to call this an application issue. 😄 |
First, Thank you for your continuing investigating the pasting issues. I am not sure I understand your remark about mintty. The latter has been embraced both by the maintainers and by the users as the default terminal for Cygwin. It is Cygwin’s “mainstream” so to speak, not some hack by an external enthusiast. I don’t know anybody who would be using the old Cygwin console for any work requiring Unicode. The pasting issue is clearly in the area how the terminal application interacts with Cygwin or Cygwin Bash. If the mintty’s developer faced the same issue, he clearly knew how to solve it. Assuming you are in charge of maintaining WindowsTerminal, it would be a great idea to contact the maintainers of Cygwin to find out from them how to fix the communication between Windows Terminal and Cygwin. WindowsTerminal will be quickly adopted by a lot of people who have been using Cygwin for many years as a great alternative to mintty, provided all the basic issues are resolved. At the moment WindowsTerminal has some advantages over mintty, like ability to substitute missing characters from fonts that are not monospaced (where can I learn more about how that is achieved?). I will be grateful also for fixing pasting into Powershell and CMD. At the moment editing any text files inside of Powershell, that contain Unicode characters is simply impossible. I have such characters in nearly every file I write/edit. With CMD the problem is less severe, the surrogate pairs are not properly displayed but they are nevertheless there and they can be properly copied. A separate but closely related issue: make the Vim people aware that Vim on Windows is now Vim in WindowsTerminal. The current Vim Windows binaries are completely broken in WindowsTerminal if I paste any Unicode character, and I don’t know where the problem lies. |
@Mariusz-W I've had the same issue with Vim, even if I used the surrogate pair outside the vim session. It completely borks my terminal to open any files to look at them :( All I want is a snake for my virtualenv prompt lol |
To, to address a couple of your points... it's complicated. I've got good news, though. Cygwin on Windows Terminal and Vim on Windows Terminal are Cygwin and Vim on the traditional Windows console. We have had support for xterm-compatible control sequences for five years now. It's been backwards compatible for that whole time, too. Five years is a long time for the application and framework authors to support these things! Windows Terminal builds on that support, but at its heart it is still the Windows console code. That code (in this repository, it's in src/host, src/server, etc.) actually goes right back into Windows! Everything Terminal supports today, the traditional Windows console will support "tomorrow". Not exactly tomorrow because code moves somewhat slowly in the Windows world, but tomorrow-ish. We've been in touch with both projects, and they're to some extent aware of this. 😄 My point about mintty was less that it was bad (it isn't!). It was more that mintty and Cygwin have a contract where Cygwin gives them special treatment (reading and writing data in its preferred format). Cygwin fully supports targeting the Windows console infrastructure, it just has to use the Windows Console APIs to do it. It is their integration with those APIs that is not working properly.
Because mintty and Cygwin use that special communication channel, they probably never had that issue. Our communication channel is the same one that Windows has had for the past two decades, not a "new" one that they need to learn to integrate with. The recommend mintty because, for a very long time, the console simply didn't support the things they needed. It's only recently (again, ~5 years) that it has! Cygwin 3.1 recently came out, and they finally took a dependency on us for VT support. I can only hope that their subsequent revisions will improve the state of their integration with the console. 😁 (note: vim is largely the same -- they have a Win32 compatibility layer that is no longer necessary, but replacing it with native VT support is time-consuming work!) Hope that helps! |
Thank you for clarifying the situation. I am hoping that WindowsTerminal receives the same “preferential treatment” from the Cygwin team as mintty does, since it has some advantages over mintty. The result is that I use daily both, often in parallel. A few of the “annoyances” of using Cygwin bash in WindowsTerminal:
if the path is: F:\Classes mintty will paste it as: /cygdrive/f/Classes WindowsTerminal just pastes F:\Classes which, without surrounding the whole path in quotes or manually transforming it into a valid Cygwin path, is unusable.
I really hope WindowsTerminal’s “high visibility” does not diminish and grows even further, so that both the maintainer of the Windows Console Vim binary , and the Cygwin team, feel morally obliged to make necessary steps to fully embrace WindowsTerminal. I will take a look at the WindowsTerminal sources, because I am very curious how WindowsTerminal solves the font substitution problem. This is an issue of greatest interest to me, because the files I create contain, by default, characters from many different Unicode blocks. Nearly for anything that doesn’t require extensive Unicode support I am still using the regular Windows Console a lot. |
The mintty developer provided me with some details about how mintty is handling the clipboard — the whole conversion from/to UTF-16 is handled by mintty alone. According to him “(t)he Cygwin console handler code does not handle” characters that Windows encodes with surrogate pairs. This seems then to be the root cause of the pasting problem reported here. It is thus an urgent matter to file a bug report with Cygwin. The first Supplemental Plane is now being filled with very important stuff, all of the Mathematical alphabets and numerous symbols are there, DejavuSans, Microsoft Cambria, Stix (a joint project of major academic publishing houses, Amerarican Mathematical and Physical Societies), fully support them. They are practically unusable in WIndowsTerminal until the pasting is handled properly. Also practically all of the Ancient scripts that are essential for the philological work. I don’t mention emojis because for me they have no value but I am sure a lot of people whose brain is totally immersed in social media could be asking for them. |
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
LC_CTYPE=en_US.UTF-8
Windows Terminal
Version: 1.4.3141.0
Attempting to paste, with the mouse right button or with Ctrl-Shift-V, a Unicode character from the Unicode Supplemental Planes in Cygwin bash shell has the following effect.
Instead of pasting the character, it pastes the 2nd element of the surrogate pair representing that character in the UTF-16BE encoding.
Example: attempting to paste 🀄 it pastes instead �
Character 🀄 has the Unicode code point: 1F004
UTF-16BE encodes 🀄 as the surrogate pair: D83C DC04
I see the 2nd surrogate element directly, when I paste the character within Cygwin ‘vim’.
I see the 2nd surrogate element re-encoded as the corresponding UTF-8 three bytes sequence, when I paste the character into the Cygwin bash command line. In the above example that happens to be the sequence : EF BF BD .
This suggest that the problem is caused by the fact that the 1st element of the surrogate pair is not passed by the WindowsTerminal to Cygwin Bash along with the 2nd element. If both elements were passed, the problem would likely go away.
The text was updated successfully, but these errors were encountered: