Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Console UTF-8 input is misbehaving on Windows #43295

Closed
Tracked by #64487
alexrp opened this issue Oct 12, 2020 · 23 comments
Closed
Tracked by #64487

Console UTF-8 input is misbehaving on Windows #43295

alexrp opened this issue Oct 12, 2020 · 23 comments
Labels
area-System.Console bug os-windows tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@alexrp
Copy link
Contributor

alexrp commented Oct 12, 2020

Description

using System;

namespace Test
{
    static class Program
    {
        static void Main()
        {
            Console.WriteLine(Console.InputEncoding);
            Console.WriteLine(Console.OutputEncoding);
            Console.WriteLine();
            Console.Write("Text   : ");
            Console.WriteLine("Result : {0}", Console.ReadLine());
        }
    }
}

Example run:

$ dotnet run
System.Text.UTF8Encoding+UTF8EncodingSealed
System.Text.UTF8Encoding+UTF8EncodingSealed

Text   : abcæøådef
Result : abc   def

The non-ASCII characters are basically being replaced with NULs for some reason.

This happens in all terminals I've tried (CMD, PowerShell, Windows Terminal, mintty). I checked chcp in all conhost-based terminals, and it reported 65001 (UTF-8) everywhere. I've also enabled global UTF-8 in Windows region settings just for good measure (enabling/disabling it appears to make no difference).

What is fascinating here is that this only seems to happen in .NET Core processes. No other programs in any of the terminals I've tried have issues processing non-ASCII characters. For example, things like this work in all of them:

$ echo abcæøådef
abcæøådef
$ cat
abcæøådef
abcæøådef

What is even more fascinating is that if you P/Invoke ReadFile to read from standard input in the .NET Core program instead of using System.Console, you get the same issue: The read is successful but non-ASCII characters are just replaced with NULs.

So the question is: Why are .NET Core processes special? What does .NET Core do that seemingly makes ReadFile misbehave?

Configuration

$ dotnet --info
.NET SDK (reflecting any global.json):
 Version:   5.0.100-rc.1.20452.10
 Commit:    473d1b592e

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.19042
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\5.0.100-rc.1.20452.10\

Host (useful for support):
  Version: 5.0.0-rc.1.20451.14
  Commit:  38017c3935
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Console untriaged New issue has not been triaged by the area owner labels Oct 12, 2020
@ghost
Copy link

ghost commented Oct 12, 2020

Tagging subscribers to this area: @eiriktsarpalis, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@huoyaoyuan
Copy link
Member

Can also reproduce with Chinese characters, but they are replaced with 0 instead of NUL.

@mayorovp
Copy link
Contributor

Can also reproduce with .NET Framework 4.7.2

@stephentoub
Copy link
Member

FWIW, I don't see this on my machine. I get this:

C:\Users\stoub\Desktop\tmp>dotnet run
System.Text.OSEncoding
System.Text.OSEncoding

Text   : abcæøådef
Result : abcæoådef

with:

.NET SDK (reflecting any global.json):
 Version:   5.0.100-rc.2.20480.7
 Commit:    53e0c8c7f9

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.19042
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\5.0.100-rc.2.20480.7\

I'd guess we have different code pages set globally by default somewhere, as Console uses the win32 GetConsoleCP/GetConsoleOutputCP functions on Windows to determine what encoding to use.

cc: @tarekgh, @krwq, @safern

@huoyaoyuan
Copy link
Member

I checked chcp in all conhost-based terminals, and it reported 65001 (UTF-8) everywhere.

This should be the key point. Check "use UTF-8 for non-Unicode programs".

1 similar comment
@huoyaoyuan
Copy link
Member

I checked chcp in all conhost-based terminals, and it reported 65001 (UTF-8) everywhere.

This should be the key point. Check "use UTF-8 for non-Unicode programs".

@huoyaoyuan
Copy link
Member

image
(Opened a new sandbox to get the screenshot in English)

@huoyaoyuan
Copy link
Member

image
This happens at input side. Hard-coded strings can be outputted correctly.
I had seen this issue in Visual Studio elsewhere, but didn't collect which parts are influenced.

@alexrp
Copy link
Contributor Author

alexrp commented Oct 12, 2020

@huoyaoyuan I mentioned in the bug description that I have tried both with that setting off and on; it doesn't seem to make any difference. I think whatever you set with chcp will be what .NET cares about at the end of the day, in any case.

@stephentoub for context, my region settings are:

image

@tarekgh
Copy link
Member

tarekgh commented Oct 12, 2020

@alexrp could you please send the content of the registry key OEMCP under the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage?

Also, after changing the Region Settings option to enable UTF-8, did you reboot the machine after that?

Another thing, what font you are using in the console too.

@huoyaoyuan
Copy link
Member

the content of the registry key

For me, it's 65001

did you reboot the machine after that

I had turned on this for 1 year

what font you are using in the console too

Just Consolas. It should be unrelated.

@alexrp
Copy link
Contributor Author

alexrp commented Oct 13, 2020

@alexrp could you please send the content of the registry key OEMCP under the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage?

65001

Also, after changing the Region Settings option to enable UTF-8, did you reboot the machine after that?

Yes.

Another thing, what font you are using in the console too.

I don't think it matters since the text gets garbled at input, not output, but: CMD and PowerShell use Lucida Console, Windows Terminal and mintty use Consolas.

@krwq
Copy link
Member

krwq commented Oct 13, 2020

Seems I can repro this but only when I change code page of console to 65001 but note the output for 437 is also not exactly the same as input (ø => o)

> dotnet run
System.Text.OSEncoding
System.Text.OSEncoding

Text   : abcæøådef
Result : abcæoådef

> chcp
Active code page: 437

> chcp 65001
Active code page: 65001

> dotnet run
System.Text.UTF8Encoding+UTF8EncodingSealed
System.Text.UTF8Encoding+UTF8EncodingSealed

Text   : abcæøådef
Result : abc   def

@mayorovp
Copy link
Contributor

Output for 437 is expected. Input redirection works as expected too:

>chcp 65001
Active code page: 65001

>echo abcæøådef | dotnet run
System.Text.UTF8Encoding+UTF8EncodingSealed
System.Text.UTF8Encoding+UTF8EncodingSealed

Text   : Result : abcæøådef

So only direct read from console is broken.

@krwq
Copy link
Member

krwq commented Oct 13, 2020

This might be interesting

Code
using System;
using System.IO;
using System.Linq;
using System.Text;

namespace Test
{
    static class Program
    {
        static void PrintHex(Span<byte> bytes)
        {
            foreach (byte x in bytes)
            {
                Console.Write($"{x:X2} ");
            }

            Console.WriteLine();
        }

        static void Main()
        {
            string problematic = @"abcæøådef";
            Console.WriteLine(Console.InputEncoding);
            Console.WriteLine(Console.OutputEncoding);
            Console.WriteLine();
            Console.WriteLine($"original: {problematic}");
            Console.Write(" Text   : ");
            Console.WriteLine(" Result : {0}", Console.ReadLine());
            Stream stdin = Console.OpenStandardInput();

            Console.Write(" Text   : ");
            byte[] bytes = new byte[100];
            int readBytes = stdin.Read(bytes);
            Span<byte> input = new Span<byte>(bytes).Slice(0, readBytes);

            Console.Write("   input: ");
            PrintHex(input);
            Console.Write("in. conv: ");
            Console.Write(Console.InputEncoding.GetString(input));

            Console.Write("original: ");
            PrintHex(Console.InputEncoding.GetBytes(problematic));

        }
    }
}
System.Text.UTF8Encoding+UTF8EncodingSealed
System.Text.UTF8Encoding+UTF8EncodingSealed

original: abcæøådef
 Text   : abcæøådef
 Result : abc   def
 Text   : abcæøådef
   input: 61 62 63 00 00 00 64 65 66 0D 0A
in. conv: abc   def
original: 61 62 63 C3 A6 C3 B8 C3 A5 64 65 66

@krwq
Copy link
Member

krwq commented Oct 13, 2020

I suspect the problem might be us using ReadFile (useFileAPIs under debugger is true): https://github.com/dotnet/runtime/blob/master/src/libraries/System.Console/src/System/ConsolePal.Windows.cs#L1167

I see people report issues with that:
https://stackoverflow.com/questions/48176431/reading-utf-8-characters-from-console
microsoft/terminal#4551 (comment)

I think we should always go to the other code path and possibly do some conversion there (or hard-code the input encoding to whatever ReadConsole is using on Windows)

@krwq krwq added bug and removed untriaged New issue has not been triaged by the area owner labels Oct 13, 2020
@krwq krwq added this to the 6.0.0 milestone Oct 13, 2020
@krwq
Copy link
Member

krwq commented Oct 13, 2020

@danmosemsft would this meet the bar for 5.0/servicing? This has impact on every customer using console apps with non ASCII characters with .NET. I know this repros at minimum in 3.1 and likely lower as well.

@jeffhandley
Copy link
Member

@krwq I recommend we get the fix into 6.0.0. After that, if we receive enough reports of users being blocked by this bug, we'd consider down-level servicing. We would need validation from users who have encountered this that the behavior is indeed fixed with the 6.0.0 builds.

@ufcpp
Copy link
Contributor

ufcpp commented Oct 24, 2020

Another result in RC 2 with CP 932...

image
image

@giovinazzo-kevin
Copy link

Can reproduce on .NET 6

@Gnbrkm41
Copy link
Contributor

This also reproduces in .NET 7 p5:

❯ dotnet --info
.NET SDK:
 Version:   7.0.100-preview.5.22307.18
 Commit:    bd8b088037

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.25151
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\7.0.100-preview.5.22307.18\

global.json file:
  Not found

Host:
  Version:      7.0.0-preview.5.22301.12
  Architecture: x64
  Commit:       425fedc0fb

.NET SDKs installed:
  6.0.300 [C:\Program Files\dotnet\sdk]
  7.0.100-preview.5.22307.18 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 6.0.5 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.0-preview.5.22303.8 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 3.1.22 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.0-preview.5.22301.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.1.22 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.3 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.5 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 7.0.0-preview.5.22302.5 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Download .NET:
  https://aka.ms/dotnet-download

Learn about .NET Runtimes and SDKs:
  https://aka.ms/dotnet/runtimes-sdk-info

❯ Get-WinSystemLocale

LCID             Name             DisplayName
----             ----             -----------
1042             ko-KR            한국어(대한민국)
Console.InputEncoding = Encoding.UTF8;
Console.WriteLine($"Current Encoding: {Console.InputEncoding}");
Console.WriteLine($"Input: {Console.ReadLine()}");

The following code, when '가나다' is inputted, displays nothing - the string returned from Console.ReadLine appears to be three NUL characters instead of the actual characters that was put in. It seems to work fine with Encoding.Unicode though.

@jeffhandley jeffhandley modified the milestones: 7.0.0, Future Jul 9, 2022
hsheric0210 added a commit to hsheric0210/SecureLookup that referenced this issue Jan 27, 2023
[+] Added NameRegen command: Now you can rename archive files without editing all entries manually
[+] Dictionary file support: To bypass dotnet/runtime#43295
@o-sdn-o
Copy link

o-sdn-o commented May 20, 2023

Looks like it's fixed by microsoft/terminal#14745.

@adamsitnik
Copy link
Member

Looks like it's fixed by microsoft/terminal#14745.

@o-sdn-o thank you for letting us know!

In such case, I am closing the issue.

@adamsitnik adamsitnik added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label May 22, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Console bug os-windows tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests