Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globalization: ToolTask batch file encoding does not respect UTF-8 encoding #5724

Closed
jgoshi opened this issue Sep 9, 2020 · 16 comments · Fixed by #6188
Closed

Globalization: ToolTask batch file encoding does not respect UTF-8 encoding #5724

jgoshi opened this issue Sep 9, 2020 · 16 comments · Fixed by #6188

Comments

@jgoshi
Copy link

jgoshi commented Sep 9, 2020

Issue Description

ToolTask supports setting the StandardOutputEncoding but does not work properly for UTF-8 encoding.

Steps to Reproduce

  1. Japanese OS
  2. Use the attached project
    BatchEncodingBug.zip
  3. From a Developer Command Prompt run the following.
    a. chcp 65001
    b. MSBuild.exe BatchEncodingBug.vcxproj /t:TestExec

Expected Behavior

The output test should correctly display Japanese characters.

Actual Behavior

Some of the text is garbled.

Analysis

If you look at the provided vcxproj file, we use the task and set the output encoding to UTF-8. Because (in the repro steps) we ran "chcp 65001" we should be using UTF-8. When when msbuild runs the Exec, the batch file it generates is saved in ANSI instead of UTF-8. After running msbuild you'll see "file.cmd" which is a copy of the msbuild generated batch file and can verify that it is in ANSI.

This is coming from this file:
https://github.com/dotnet/msbuild/blob/f2c4bfd563f559daca27ea1cd8ae40db24e2e7cd/src/Utilities/ToolTask.cs

Look for EncodingUtilities.BatchFileEncoding. Ideally msbuild should respect the console code page.

@jgoshi jgoshi added bug needs-triage Have yet to determine what bucket this goes in. labels Sep 9, 2020
@benvillalobos
Copy link
Member

This looks similar to #4870

@tristanlabelle
Copy link

tristanlabelle commented Sep 9, 2020

A few extra details:

The reason why Exec/ToolTask saves the batch file using the OEM encoding is that it detects that all characters can be encoded using this encoding, since there is an echo with japanese characters and the OEM encoding is the japanese codepage. The problem is that cmd.exe does not use the OEM encoding for reading batch files but rather the console (input) code page (GetConsoleCP/Console.InputEncoding), which normally defaults to OEM but in this case was changed to utf-8 by virtue of using chcp 65001.

The fix may be to have EncodingUtilities.BatchFileEncoding use Console.InputEncoding where it currently uses SystemOEMEncoding.

I also want to point out that this is a world readiness issue since the message will be printed correctly or not depending on the OEM encoding (hence OS language), and it surfaces in the use of <CustomBuild><Message> in C++ projects.

@rainersigwald
Copy link
Member

We can't use Console.InputEncoding--that applies only to a single process, and MSBuild has long-lived background processes that don't have a console (and may have been started under a different codepage).

@tristanlabelle
Copy link

Then you can P/Invoke for GetConsoleCP if that's any different. The idea is that at the time where cmd.exe is about to be started, the current process' console input code page is what cmd.exe will inherit and use to read the .cmd file, not the OEM codepage.

@benvillalobos benvillalobos removed the needs-triage Have yet to determine what bucket this goes in. label Sep 16, 2020
@Forgind
Copy link
Member

Forgind commented Jan 11, 2021

This came very late, but I could break this repro by switching StdOutEncoding="UTF-8" to UseUtf8Encoding="ALWAYS". Is that viable for https://dev.azure.com/devdiv/DevDiv/_workitems/edit/1178352?

@Forgind
Copy link
Member

Forgind commented Jan 11, 2021

image

@marcpopMSFT
Copy link
Member

@jgoshi checking if you've had a chance to review Forgind's comment

@jgoshi
Copy link
Author

jgoshi commented Feb 1, 2021

@marcpopMSFT Do you mean if I've had a chance to try out to see if this bug is fixed:
https://dev.azure.com/devdiv/DevDiv/_workitems/edit/1178352

If I change all appearances of StdOutEncoding="UTF-8" to UseUtf8Encoding="ALWAYS" in my projects?

I haven't had a chance to try that out yet. I'll try to get access to a Japanese OS machine.

@MrTrillian do you see any issues with this approach (if it works)? It would mean we'd need to change the Kitware CMake code to emit UseUtf8Encoding="ALWAYS" instead of what they currently do (emit StdOutEncoding="UTF-8").

@tristanlabelle
Copy link

tristanlabelle commented Feb 2, 2021

I've just spent an hour trying to page back in the subtleties of this. I think the answer is that in the original bug, we're not using <Exec> directly but rather our <CustomBuild> which inherits from ToolTask, which does not have a UseUtf8Encoding property, see:

encoding = EncodingUtilities.BatchFileEncoding(commandLineCommands + _temporaryBatchFile, EncodingUtilities.UseUtf8Detect);

There might be other reasons why it would be problematic even if we had UseUtf8Encoding, but it's so hard to reason about these issues... cmd.exe is such a major source of internationalization problems. :(

@Forgind
Copy link
Member

Forgind commented Feb 2, 2021

I can make something to add a UseUtf8Encoding property to ToolTask, have it default to detect, and let you overwrite it to always, then use it on the line you linked. Then we can try to test it. If that doesn't work for some reason, I'd say it's time to formally get a permanent exception. I think we've both forgotten about this bug and spent time trying to remember details more than once at this point. How does that sound?

@Forgind
Copy link
Member

Forgind commented Feb 4, 2021

https://github.com/Forgind/msbuild/tree/UseUtf8Encoding has the change to permit specifying UseUtf8Encoding in a ToolTask.

@tristanlabelle
Copy link

tristanlabelle commented Feb 8, 2021

If I recall, the problem is this:

If cmake migrates to generating VS solutions that set UseUtf8Encoding="ALWAYS", then using cmake directly (outside of VS, in a cmd.exe), will have the side-effect of changing the console code page because msbuild will share the console with cmake and will inject a chcp 65001. It would be badly behaved for a command-line program to change its parent console's code page since that could have all kinds of other side-effects. For example, if cmake was being called in a batch file, it would change how the remaining bytes of that batch file get decoded. Inside VS, we don't have that problem because we are the root program and any consoles will be created only for child processes like when we launch msbuild.

@Forgind
Copy link
Member

Forgind commented Feb 8, 2021

Could you save it off, call MSBuild, and restore it? I remember trying to do something like that in MSBuild, and it hit a snag because there wasn't anything to attach a code page to, but presumably you could do it?

Also, I'm a little surprised, since I thought we spun up a new process for executing the code page work, hence that you changing the code page to 65001 didn't affect whether MSBuild used utf-8 or not. I may be misremembering that part.

@tristanlabelle
Copy link

tristanlabelle commented Feb 9, 2021

In this scenario, there is no Visual Studio involved to save off and restore the chcp. Anyone can use cmake.exe directly from the command line.

New processes don't guarantee isolation from code page changes, it depends on how the new processes are created. By default, they will share the console with their parent, and the console is what holds the code page state. msbuild probably needs child processes to share the console with their parent so they can output to it.

@Forgind
Copy link
Member

Forgind commented Feb 11, 2021

I wasn't necessarily thinking of saving it in association with anything in particular. If it were C#, I'd just assign a random variable to the code page and call chcp at the end. From the command line, maybe an environment variable? I'm assuming you wouldn't have to do this for multiple different code pages in a single build.

@tristanlabelle
Copy link

We can't implement this because anyone (say, my uncle) could call cmake.exe directly in a cmd.exe window. In that case, no code from Visual Studio is launching cmake.exe and can sandwich it between chcp calls. It's like if you open notepad on your computer, Visual Studio doesn't have a hook to run code before and after (hopefully!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants