Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing code pages? #7854

Closed
vefatica opened this issue Oct 7, 2020 · 20 comments
Closed

Changing code pages? #7854

vefatica opened this issue Oct 7, 2020 · 20 comments
Labels
Issue-Question For questions or discussion Needs-Tag-Fix Doesn't match tag requirements Resolution-Answered Related to questions that have been answered Resolution-By-Design It's supposed to be this way. Sometimes for compatibility reasons.

Comments

@vefatica
Copy link

vefatica commented Oct 7, 2020

Environment

Microsoft Windows 10 Pro for Workstations
10.0.18363.1082 (1909)
WindowsTerminalPreview_1.4.2652.0_x64

Windows build number: [run `[Environment]::OSVersion` for powershell, or `ver` for cmd]
Windows Terminal version (if applicable):

Any other software?

Steps to reproduce

Expected behavior

As in a console.

Actual behavior

I don't know what to ask except for "What's happening here?". I have a 128-byte file containing the bytes 128~255. In both cases below, the font is Consolas.

Using CMD.EXE in a console I see this (which looks pretty good).

image

Using CMD.EXE in Windows Terminal I see this (which doesn't look as good).

image

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Oct 7, 2020
@vefatica
Copy link
Author

vefatica commented Oct 7, 2020

Here's the file (renamed)
128-255.txt
.

@DHowett
Copy link
Member

DHowett commented Oct 7, 2020

Hey, a fun intersection between codepages and @skyline75489's work on C1 control codes!

Bad news: this is by design.
Double bad news: I'm not certain how to comport these things.

Notes

This file contains the bytes 128-255. When translated in codepage 1252, these values become codepoints:

0x80: 20AC
0x81: ----
0x82: 201A
...
0x8E: 017D 
0x8F: ----
0x90: ----

The translations for 81, 8F, 90, and a few others are unspecified as part of the codepage. This means that a receiving application is free to do pretty much anything.

Wikipedia notes that "MultiByteToWideChar" maps them to the corresponding C1 control codes.

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too.

As of #7340, the Windows Console will properly treat 80..9F as control characters. The codepage says we're supposed to treat them like control characters.

Applications that want to print literal invalid characters (characters unspecified in the output codepage are not valid characters!) to the screen should not be using VT processing mode. CMD is not such an application: type is intended to print valid ANSI data in the system's codepage to the screen. If given a file whose contents are unrepresentable in the ANSI codepage, it will behave erratically.

@DHowett DHowett added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Oct 7, 2020
@eryksun
Copy link

eryksun commented Oct 8, 2020

type is intended to print valid ANSI data in the system's codepage to the screen.

CMD's internal type command decodes a file using the console's output codepage (i.e. GetConsoleOutputCP()), not the process ANSI codepage. It also detects UTF-16 if the file starts with a BOM. The decoded text is written to the console via wide-character WriteConsoleW.

Here are some examples with the C1 control characters, using Python under Windows Terminal.

The most powerful of the C1 control characters is CSI (0x9b), which also works in a regular console with virtual-terminal (VT) mode enabled. Fortunately codepage 1252 maps 0x9b, so it's not an issue here.

>>> # CSI (0x9b): Control Sequence Introducer
>>> print('spam\x9b4Deggs') # CSI 4D
eggs

The following examples are currently only implemented by Windows Terminal. When writing to a system conhost.exe console (at least as of release 2004), even with VT mode enabled, the C1 codes simply appear as default glyphs (e.g. an empty rectangle). Of interest here regarding codepage 1252 are 0x8D (R1), 0x8F (SS3), 0x90 (DCS), and 0x9D (OSC). CP1252 also doesn't map 0x81, but the HOP (High Octet Preset) control code is ignored.

>>> # IND (0x84): Index (Line Feed)
>>> print('spam\x84eggs')
spam
    eggs

>>> # NEL (0x85): Next Line
>>> print('spam\x85eggs') 
spam
eggs

>>> # RI (0x8d): Reverse Index (Line Feed)
>>> print('\nspam\x8deggs\n')
    eggs
spam

>>> # SS2 (0x8e): Single-Shift G2
>>> # SS3 (0x8f): Single-Shift G3
>>> print('\x8e0\x8e1\x8e2\x8e3')
°±²³
>>> print('\x8f0\x8f1\x8f2\x8f3')
°±²³

>>> # ST (0x9C): String Terminator
>>> # DCS (0x90): Device Control String
>>> # SOS (0x98): Start of String
>>> # OSC (0x9d): Operating System Command
>>> # PM (0x9e): Privacy Message
>>> # APC (0x9f): Application Program Command
>>> print('\x90spam\x9ceggs')
eggs
>>> print('\x98spam\x9ceggs')
eggs
>>> print('\x9dspam\x9ceggs')
eggs
>>> print('\x9espam\x9ceggs')
eggs
>>> print('\x9fspam\x9ceggs')
eggs

@DHowett
Copy link
Member

DHowett commented Oct 8, 2020

Good catch on the specifics of the implementation of TYPE. Horrifyingly, it's written to call WriteConsole and hope that UNICODE is set. Those sure were the days.

The console host that comes out with the version of Windows after 2004 will contain the changes from #7317.

@eryksun
Copy link

eryksun commented Oct 8, 2020

Unfortunately most Windows filesystems only reserve the C0 block in filenames, not the C1 block. I don't want displaying a filename to evaluate CSI sequences or IND, RI, and NEL line feeds. POSIX systems permissively allow control characters in filenames, but POSIX CLI programs such as ls address this by escaping the C0 and C1 control characters when displaying files. This is not the case for Windows PowerShell and CMD:

Python

>>> import os; os.listdir('.')
['spam\x9b4Deggs.txt']

WSL

/mnt/c/Temp/test$ ls
'spam'$'\302\233''4Deggs.txt'

PowerShell

PS C:\temp\test> gci -n
eggs.txt

CMD

C:\Temp\test>dir /b
eggs.txt

@vefatica
Copy link
Author

vefatica commented Oct 8, 2020

Thanks gentlemen. I didn't know most of that. Here's a question (I'll probably have more).

I imagine it was the OSC (0x9D) that was cutting off the tail in my example (CP 1252 in Windows Terminal). Wiki says OSC should be followed by a string of printables (0x32~0x7E). That was not the case in my example. What's up with that? And was it waiting for ST (0x9C)? CP 1252 uses 0x9C.

@ghost ghost added Needs-Attention The core contributors need to come back around and look at this ASAP. and removed Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something labels Oct 8, 2020
@j4james
Copy link
Collaborator

j4james commented Oct 8, 2020

Any escape or C1 control should terminate the sequence, as would the SUB and CAN control characters. For example, I'm assuming your prompt has an escape sequence that sets the color to green - that was most likely the terminator in your case. Without that you probably would have found the terminal stuck in a weird state where you couldn't see any output because it would think it was still processing an OSC or APC sequence (whichever came last).

@vefatica
Copy link
Author

vefatica commented Oct 8, 2020

j4james, as I read your comments I was trying to get it stuck as you said (not realizing that is was my prompt preventing it).

What about wiki's comment that the Operating System Command be composed of 0x32~0x7F? Is it accurate? Characters > 0x7F don't, in general, terminate the OSC string. Neither do characters < 32 except for ESC. ST (0x9C) does even though it's used by the CP. And seeing that OSC is honored, what happens to the string itself ... ignored?

@vefatica
Copy link
Author

vefatica commented Oct 8, 2020

I spoke a bit prematurely. In fact, 0x7 (BEL), 0x18 (CAN), and 0x1A (SUB) also terminate the OSC string.

@vefatica
Copy link
Author

vefatica commented Oct 8, 2020

And I think I was wrong about ST (0x9C) terminating OSC.

@j4james
Copy link
Collaborator

j4james commented Oct 8, 2020

BEL is a bit of a weird case. Technically it's not a standard string terminator, but at some point in the past it was used as an OSC terminator by a popular terminal emulator, and that ended up becoming a de facto standard.

As for characters > 0x7F, I believe anything in the range 0xA0 to 0xFE is technically supposed to be interpreted as 0x20 to 0x7E when included in a control sequence. I don't think we follow those rules exactly, but since we're typically dealing with Unicode the original specs don't really apply in that sense. Either way, though, I wouldn't expect a character > 0xA0 to terminate a string sequence. C1 controls should, although there may be exceptions - I'm not positive about that.

And yes ST should terminate an OSC sequence.

@DHowett
Copy link
Member

DHowett commented Oct 9, 2020

(Closing as by design + question, but do feel free to continue the discussion!)

@DHowett DHowett closed this as completed Oct 9, 2020
@DHowett DHowett added Issue-Question For questions or discussion Resolution-Answered Related to questions that have been answered Resolution-By-Design It's supposed to be this way. Sometimes for compatibility reasons. and removed Needs-Attention The core contributors need to come back around and look at this ASAP. Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Oct 9, 2020
@vefatica
Copy link
Author

vefatica commented Oct 9, 2020

Do any OSC sequences work? The only one I tried was OSC2;titleBEL. That didn't work. Neither did the equivalent (?) ESC]2;titleBEL (which does change the title in a conhost console).

@zadjii-msft
Copy link
Member

Uh yea, a whole bunch of them should

enum OscActionCodes : unsigned int
{
SetIconAndWindowTitle = 0,
SetWindowIcon = 1,
SetWindowTitle = 2,
SetWindowProperty = 3, // Not implemented
SetColor = 4,
Hyperlink = 8,
SetForegroundColor = 10,
SetBackgroundColor = 11,
SetCursorColor = 12,
SetClipboard = 52,
ResetForegroundColor = 110, // Not implemented
ResetBackgroundColor = 111, // Not implemented
ResetCursorColor = 112
};

What string exactly are you trying to emit? And do you have "suppressApplicationTitle": true set?

@vefatica
Copy link
Author

vefatica commented Oct 9, 2020

I have: "suppressApplicationTitle": true

I've tried both of these:

L"\x009d2;new_title\x0007"
L"\x001b]2;new_title\x0007"

I'm using WriteConsoleW and a HANDLE to L"CONOUT$". The second one above works in conhost.

I can also send them from a TCC command line.

echos %@char[0x9d]2;new_title%@char[7]
echos ^e]2;new_title%@char[7]

The results are the same; neither works in WT and the second works in a conhost console.

@DHowett
Copy link
Member

DHowett commented Oct 9, 2020

"suppressApplicationTitle": true

This disables OSC2.

@vefatica
Copy link
Author

vefatica commented Oct 9, 2020

OK. I was thinking that affected only SetConsoleTitle(). So without suppressApplicationTitle = true,

L"\x001b]2;new_title\x0007" works

and

L"\x009d2;new_title\x0007" doesn't work.

@zadjii-msft
Copy link
Member

I don't believe the C1 codes work quite yet, see #7340

@vefatica
Copy link
Author

vefatica commented Oct 9, 2020

Actually, it either a compiler bug (VS Community 2019) or my misunderstanding. This page seems to make it clear that L"\xhhhh" denotes a wide char. Any more that 4 hex digits doesn't make sense! Yet these two strings are different:

WCHAR sz1[32];
wsprintf(sz1, L"%c2;new_title1\x0007", 0x9d);
WCHAR sz2[32] = L"\x009d2;new_title1\x0007";

The first one above DOES work in Windows Terminal. The first character of the second one is 0x9d2

@vefatica
Copy link
Author

Yup, my misunderstanding (or more like ignorance). This works

L"\x009d" L"2;new_title1\x0007"

as does this

L"\u009d2;new_title1\x0007"

but I'm not sure why the second one works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Question For questions or discussion Needs-Tag-Fix Doesn't match tag requirements Resolution-Answered Related to questions that have been answered Resolution-By-Design It's supposed to be this way. Sometimes for compatibility reasons.
Projects
None yet
Development

No branches or pull requests

5 participants