Windows: Don't error on broken non UTF-8 output #134534

ChrisDenton · 2024-12-19T19:29:11Z

Currently, on Windows, the standard library will error if you try to write invalid UTF-8. Whereas other platforms allow this. The issue arises because the console uses UTF-16 so Rust has to re-encode the output.

This PR fixes it in two ways:

If the console's code page is set to UTF-8 then we can just write bytes to the handle normally.
Otherwise we do a lossy conversion to UTF-16.

Fixes #116871

Also update the tests to avoid testing implementation details.

tbu- · 2024-12-21T10:37:44Z

library/std/src/sys/pal/windows/stdio.rs

-pub struct Stdout {
-    incomplete_utf8: IncompleteUtf8,
-}
+pub struct Stdout {}


The reason for the removal of incomplete UTF-8 handling at the end of the string is not clear to me from the commit description. Why was that removed?

Because it now simply truncates the write to remove incomplete UTF-8 from the end and instead leaves the buffering to buffer types, i.e. LineWriter in this case.

A LineWriter will flush an incomplete line if its buffer capacity is exceeded. If that happens, the output must support partial UTF-8 writes, or non-ASCII characters might get lost or replaced with the replacement character.

That can only result in broken UTF-8 if the user writes incomplete UTF-8 to LineWriter themselves.

I see, digging through the source code, BufWriter makes sure to not split writes that the user issued.

What is the motivation for truncating invalid UTF-8 at the end of the string?

All else being equal, I'd rather expect the previous behavior, that I can construct UTF-8 output byte-by-byte.

Rather than having a secret stack buffer that can't be inspected or flushed, I'd strongly prefer buffering be done at a higher level. It's also a lot of added complexity for an edge case where the better solution is to set the console code page.

In any case, if that behavior is wanted, it should probably be documented in the commit message so that it is clear to future readers that this change was on purpose.

Rather than ignoring trailing invalid UTF-8, I think it'd be better to replace it with a replacement character so that it becomes clear that something was removed.

Rather than ignoring trailing invalid UTF-8, I think it'd be better to replace it with a replacement character so that it becomes clear that something was removed.

That's what happens in this code. No bytes are ever lost. Either the caller is informed that less bytes were written than were provided or, if there is only an incomplete code point, then that is written to the console (which will be converted to replacement characters when lossy translating to UTF-16).

Ah yes, that makes sense. I didn't realize the caller would be informed by the return value of write.

What is the motivation for special casing trailing invalid UTF-8? It seems to increase the code complexity a little as well, and is not necessary for std's own use cases.

Is it for supporting a potential non-std buffered writer?

Sure, it could be removed. Stderr is not buffered by us though and there have been proposals for unbuffered stdout.

tbu- · 2024-12-21T12:14:41Z

library/std/src/sys/pal/windows/stdio.rs

-        assert!(result != 0, "Unexpected error in MultiByteToWideChar");
-
+        // The only way an error can happen here is if we've messed up.
+        debug_assert!(result != 0, "Unexpected error in MultiByteToWideChar");


Suggested change

debug_assert!(result != 0, "Unexpected error in MultiByteToWideChar");

assert!(result != 0, "Unexpected error in MultiByteToWideChar");

I think this should be an assert since this isn't performance critical — we've just done a syscall.

ChrisDenton · 2024-12-21T15:15:27Z

I'm going to split off the LineWriter and GetConsoleCP changes into their own PRs as they are unrelated to the more controversial changes. Marking this as draft in the meantime.

bors · 2024-12-27T16:36:19Z

☔ The latest upstream changes (presumably #134822) made this pull request unmergeable. Please resolve the merge conflicts.

ChrisDenton added 2 commits December 19, 2024 19:01

Windows: Use WriteFile to write to a UTF-8 console

85eabcd

Don't error on incomplete UTF-8

bd64bcb

rustbot assigned workingjubilee Dec 19, 2024

rustbot added O-windows Operating system: Windows S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 19, 2024

This comment has been minimized.

Sign in to view

Avoid short writes in LineWriter

317d00a

Also update the tests to avoid testing implementation details.

ChrisDenton force-pushed the cp-utf8 branch from 2e5c75c to 317d00a Compare December 20, 2024 13:41

tbu- reviewed Dec 21, 2024

View reviewed changes

ChrisDenton marked this pull request as draft December 21, 2024 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows: Don't error on broken non UTF-8 output #134534

Windows: Don't error on broken non UTF-8 output #134534

ChrisDenton commented Dec 19, 2024 •

edited

Loading

This comment has been minimized.

tbu- Dec 21, 2024

ChrisDenton Dec 21, 2024

tbu- Dec 21, 2024

ChrisDenton Dec 21, 2024

tbu- Dec 21, 2024

ChrisDenton Dec 21, 2024

tbu- Dec 21, 2024

ChrisDenton Dec 21, 2024 •

edited

Loading

tbu- Dec 21, 2024

ChrisDenton Dec 21, 2024

tbu- Dec 21, 2024

ChrisDenton commented Dec 21, 2024

bors commented Dec 27, 2024

	debug_assert!(result != 0, "Unexpected error in MultiByteToWideChar");
	assert!(result != 0, "Unexpected error in MultiByteToWideChar");

Windows: Don't error on broken non UTF-8 output #134534

Are you sure you want to change the base?

Windows: Don't error on broken non UTF-8 output #134534

Conversation

ChrisDenton commented Dec 19, 2024 • edited Loading

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisDenton Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisDenton commented Dec 21, 2024

bors commented Dec 27, 2024

ChrisDenton commented Dec 19, 2024 •

edited

Loading

ChrisDenton Dec 21, 2024 •

edited

Loading