Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: Don't error on broken non UTF-8 output #134534

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ChrisDenton
Copy link
Member

@ChrisDenton ChrisDenton commented Dec 19, 2024

Currently, on Windows, the standard library will error if you try to write invalid UTF-8. Whereas other platforms allow this. The issue arises because the console uses UTF-16 so Rust has to re-encode the output.

This PR fixes it in two ways:

  • If the console's code page is set to UTF-8 then we can just write bytes to the handle normally.
  • Otherwise we do a lossy conversion to UTF-16.

Fixes #116871

@rustbot rustbot added O-windows Operating system: Windows S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 19, 2024
@rust-log-analyzer

This comment has been minimized.

Also update the tests to avoid testing implementation details.
pub struct Stdout {
incomplete_utf8: IncompleteUtf8,
}
pub struct Stdout {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the removal of incomplete UTF-8 handling at the end of the string is not clear to me from the commit description. Why was that removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it now simply truncates the write to remove incomplete UTF-8 from the end and instead leaves the buffering to buffer types, i.e. LineWriter in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A LineWriter will flush an incomplete line if its buffer capacity is exceeded. If that happens, the output must support partial UTF-8 writes, or non-ASCII characters might get lost or replaced with the replacement character.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can only result in broken UTF-8 if the user writes incomplete UTF-8 to LineWriter themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, digging through the source code, BufWriter makes sure to not split writes that the user issued.

What is the motivation for truncating invalid UTF-8 at the end of the string?

All else being equal, I'd rather expect the previous behavior, that I can construct UTF-8 output byte-by-byte.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having a secret stack buffer that can't be inspected or flushed, I'd strongly prefer buffering be done at a higher level. It's also a lot of added complexity for an edge case where the better solution is to set the console code page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, if that behavior is wanted, it should probably be documented in the commit message so that it is clear to future readers that this change was on purpose.

Rather than ignoring trailing invalid UTF-8, I think it'd be better to replace it with a replacement character so that it becomes clear that something was removed.

Copy link
Member Author

@ChrisDenton ChrisDenton Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than ignoring trailing invalid UTF-8, I think it'd be better to replace it with a replacement character so that it becomes clear that something was removed.

That's what happens in this code. No bytes are ever lost. Either the caller is informed that less bytes were written than were provided or, if there is only an incomplete code point, then that is written to the console (which will be converted to replacement characters when lossy translating to UTF-16).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that makes sense. I didn't realize the caller would be informed by the return value of write.

What is the motivation for special casing trailing invalid UTF-8? It seems to increase the code complexity a little as well, and is not necessary for std's own use cases.

Is it for supporting a potential non-std buffered writer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it could be removed. Stderr is not buffered by us though and there have been proposals for unbuffered stdout.

assert!(result != 0, "Unexpected error in MultiByteToWideChar");

// The only way an error can happen here is if we've messed up.
debug_assert!(result != 0, "Unexpected error in MultiByteToWideChar");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
debug_assert!(result != 0, "Unexpected error in MultiByteToWideChar");
assert!(result != 0, "Unexpected error in MultiByteToWideChar");

I think this should be an assert since this isn't performance critical — we've just done a syscall.

@ChrisDenton
Copy link
Member Author

I'm going to split off the LineWriter and GetConsoleCP changes into their own PRs as they are unrelated to the more controversial changes. Marking this as draft in the meantime.

@ChrisDenton ChrisDenton marked this pull request as draft December 21, 2024 15:15
@bors
Copy link
Contributor

bors commented Dec 27, 2024

☔ The latest upstream changes (presumably #134822) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-windows Operating system: Windows S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading from and writing to the Windows console should be lossy
6 participants