Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Make Console.Input/OutputEncoding default to UTF-16 on Windows #70168

Closed
huoyaoyuan opened this issue Jun 2, 2022 · 12 comments
Closed

Comments

@huoyaoyuan
Copy link
Member

Background

Currently, System.Console calls GetConsoleCP on Windows to get console encoding, which has caused enormous problems:

  • Characters not in current code page can be displayed/inputted in console, under default setting:
    image
    Without explicitly specifying Encoding.Unicode, the console can't display emoji (via Windows Terminal), or some other script not represented. (On Windows-1252 system it should not be able to display Chinese).

  • Characters are frequently transcoded in wrong way, and get garbled.
    Referring to C# Interactive is broken in VS16.8 preview5 roslyn#48874. I'm pretty annoyed too like the person in that thread.
    It's also garbling with latest dotnet SDK. The issue is newly happened with SDK update within this month (May).
    image

Proposal

ANSI codepages are totally legacy. We should totally get rid of it, and use some variant of Unicode anywhere.
The internal encoding of Windows NT is UTF-16, the same of .NET. We can also safe the time of transcoding from UTF-16 to code page then to UTF-16 again.

This would be a breaking change for ones who operates with Console.OpenStandardXXX and redirected IO, which can be addressed by setting console encoding in program entry point. We may also add a compatibility switch for this. For ASCII interoperability, we should suggest setting the encoding to UTF-8.

Additionally, setting default encoding to UTF-16 should also indicates encoding problems when using English only. Since most code pages including UTF-8 shares the ASCII range, English text always gets outputted correctly under misconfigured encoding. Since most of the development is under English, encoding problems get kept silently.

Additional words

I'd really want you to treat encoding problem as severe bug. It's never a problem for English users, but has frustrated other language users for decades, since the start of multi-language Windows. Fixing such problem in minor release of VS instead of patch release is unacceptable for me, as well as other Chinese users.
Multi-byte encoding system gets more pain from non-coding elements. Characters from wrong encoding will appear as broken mult-byte sequence (#69781).
There is Spanish build in roslyn CI. Can we add a CI leg to verify the runtime builds (and test runs?) correctly on non-English system?

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 2, 2022
@ghost
Copy link

ghost commented Jun 2, 2022

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Background

Currently, System.Console calls GetConsoleCP on Windows to get console encoding, which has caused enormous problems:

  • Characters not in current code page can be displayed/inputted in console, under default setting:
    image
    Without explicitly specifying Encoding.Unicode, the console can't display emoji (via Windows Terminal), or some other script not represented. (On Windows-1252 system it should not be able to display Chinese).

  • Characters are frequently transcoded in wrong way, and get garbled.
    Referring to C# Interactive is broken in VS16.8 preview5 roslyn#48874. I'm pretty annoyed too like the person in that thread.
    It's also garbling with latest dotnet SDK. The issue is newly happened with SDK update within this month (May).
    image

Proposal

ANSI codepages are totally legacy. We should totally get rid of it, and use some variant of Unicode anywhere.
The internal encoding of Windows NT is UTF-16, the same of .NET. We can also safe the time of transcoding from UTF-16 to code page then to UTF-16 again.

This would be a breaking change for ones who operates with Console.OpenStandardXXX and redirected IO, which can be addressed by setting console encoding in program entry point. We may also add a compatibility switch for this. For ASCII interoperability, we should suggest setting the encoding to UTF-8.

Additionally, setting default encoding to UTF-16 should also indicates encoding problems when using English only. Since most code pages including UTF-8 shares the ASCII range, English text always gets outputted correctly under misconfigured encoding. Since most of the development is under English, encoding problems get kept silently.

Additional words

I'd really want you to treat encoding problem as severe bug. It's never a problem for English users, but has frustrated other language users for decades, since the start of multi-language Windows. Fixing such problem in minor release of VS instead of patch release is unacceptable for me, as well as other Chinese users.
Multi-byte encoding system gets more pain from non-coding elements. Characters from wrong encoding will appear as broken mult-byte sequence (#69781).
There is Spanish build in roslyn CI. Can we add a CI leg to verify the runtime builds (and test runs?) correctly on non-English system?

Author: huoyaoyuan
Assignees: -
Labels:

area-System.Text.Encoding, untriaged

Milestone: -

@tarekgh
Copy link
Member

tarekgh commented Jun 2, 2022

On Windows you can set the default codepage to UTF-8 and this will reflect on all .NET applications. You can do that by running intl.cpl then click on the Administrative tab, then click on Change system locale... button, then check the box labeled with Beta: Use Unicode UTF-8 for worldwide language support.

image

on non-Windows platforms, mostly the terminals already are using UTF-8 encoding.

@huoyaoyuan
Copy link
Member Author

On Windows you can set the default codepage to UTF-8 and this will reflect on all .NET applications.

I know this option. Unfortunately, there's still tons of encoding issues under this, either existing or newly introduced. This option doesn't solve any issue at all.
Affecting all applications is not an option either. This would affect more applications than .NET, and many application won't handle this well.

Using UTF-16 has more benefit that consoles are operated using W variant of console API, instead of file API.

else
{
// If the code page could be Unicode, we should use ReadConsole instead, e.g.
// Note that WriteConsoleW has a max limit on num of chars to write (64K)
// [https://docs.microsoft.com/en-us/windows/console/writeconsole]
// However, we do not need to worry about that because the StreamWriter in Console has
// a much shorter buffer size anyway.
int charsWritten;
writeSuccess = Interop.Kernel32.WriteConsole(hFile, p, bytes.Length / BytesPerWChar, out charsWritten, IntPtr.Zero);
Debug.Assert(!writeSuccess || bytes.Length / BytesPerWChar == charsWritten);
}

@davidfowl
Copy link
Member

Is this a mega breaking change?

@ufcpp
Copy link
Contributor

ufcpp commented Jun 3, 2022

GetConsoleCP is OK. What I want is just running F5 Debug Console with CP 65001.

@huoyaoyuan
Copy link
Member Author

Is this a mega breaking change?

In fact I don't know. It also depends on how Windows handles the relationship between console file and the console APIs.

In other words, I want to switch to WriteConsoleW to the default, instead of current WriteFile.

@huoyaoyuan
Copy link
Member Author

I did some test with redirecting:

The > operator of cmd (native redirect) will write the output to file as-is, under specified encoding.
The > operator of PowerShell always read as current system encoding and write as UTF-8. It does not react with application changing its console encoding. Anything not in system encoding will be garbled.

There is no magic happened. Both side of the pipe need to get agreement about the encoding. Changing default to UTF-16 would break a lot, since UTF-16 isn't widely used as file or communication encoding.

The current behavior is far from ideal. With observing PowerShell garbling things, I understand how encoding issue happens.
We should consider to change default to UTF-8.

@huoyaoyuan
Copy link
Member Author

Today I read at OldNewThing that the default encoding can be set to UTF-8 through manifest. Although we don't own the manifest for any binaries, we can consider to set this property in default template. Setting this on the default dotnet.exe could be breaking though.

@tarekgh
Copy link
Member

tarekgh commented Jun 3, 2022

Is this a mega breaking change?

Yes, it is a big breaking change. Windows didn't make this option as a default and marking it as Beta for a while now. It is not something we need to risk and enable by default.

@jeffhandley jeffhandley added this to the Future milestone Aug 2, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Aug 2, 2022
@ghost
Copy link

ghost commented Aug 2, 2022

Tagging subscribers to this area: @dotnet/area-system-console
See info in area-owners.md if you want to be subscribed.

Issue Details

Background

Currently, System.Console calls GetConsoleCP on Windows to get console encoding, which has caused enormous problems:

  • Characters not in current code page can be displayed/inputted in console, under default setting:
    image
    Without explicitly specifying Encoding.Unicode, the console can't display emoji (via Windows Terminal), or some other script not represented. (On Windows-1252 system it should not be able to display Chinese).

  • Characters are frequently transcoded in wrong way, and get garbled.
    Referring to C# Interactive is broken in VS16.8 preview5 roslyn#48874. I'm pretty annoyed too like the person in that thread.
    It's also garbling with latest dotnet SDK. The issue is newly happened with SDK update within this month (May).
    image

Proposal

ANSI codepages are totally legacy. We should totally get rid of it, and use some variant of Unicode anywhere.
The internal encoding of Windows NT is UTF-16, the same of .NET. We can also safe the time of transcoding from UTF-16 to code page then to UTF-16 again.

This would be a breaking change for ones who operates with Console.OpenStandardXXX and redirected IO, which can be addressed by setting console encoding in program entry point. We may also add a compatibility switch for this. For ASCII interoperability, we should suggest setting the encoding to UTF-8.

Additionally, setting default encoding to UTF-16 should also indicates encoding problems when using English only. Since most code pages including UTF-8 shares the ASCII range, English text always gets outputted correctly under misconfigured encoding. Since most of the development is under English, encoding problems get kept silently.

Additional words

I'd really want you to treat encoding problem as severe bug. It's never a problem for English users, but has frustrated other language users for decades, since the start of multi-language Windows. Fixing such problem in minor release of VS instead of patch release is unacceptable for me, as well as other Chinese users.
Multi-byte encoding system gets more pain from non-coding elements. Characters from wrong encoding will appear as broken mult-byte sequence (#69781).
There is Spanish build in roslyn CI. Can we add a CI leg to verify the runtime builds (and test runs?) correctly on non-English system?

Author: huoyaoyuan
Assignees: -
Labels:

area-System.Console

Milestone: Future

@adamsitnik
Copy link
Member

Closing as a duplicate of #31466.

@adamsitnik adamsitnik closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants