Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format Duration microseconds with "us" suffix, without Unicode #75065

Closed

Conversation

joshtriplett
Copy link
Member

Formatting a Duration currently uses the suffix "µs" for microseconds.
However, this does not take into account whether the terminal or other
output device can handle Unicode. Make it more robust on legacy
terminals by using "us" instead.

Formatting a Duration currently uses the suffix "µs" for microseconds.
However, this does not take into account whether the terminal or other
output device can handle Unicode. Make it more robust on legacy
terminals by using "us" instead.
@rust-highfive
Copy link
Collaborator

r? @LukasKalbertodt

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 2, 2020
@joshtriplett
Copy link
Member Author

joshtriplett commented Aug 2, 2020

Context: when developing CLI applications for widespread use, I try to avoid assuming the user's terminal is Unicode-capable without checking first. Checking locale for something as simple as a duration suffix doesn't seem worthwhile, when there's a widely understood ASCII suffix that works as an alternative. As far as I know, no other Display or Debug instance in the Rust standard library includes non-ASCII characters; I think this is the only one.

@scottmcm scottmcm added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Aug 4, 2020
@scottmcm
Copy link
Member

scottmcm commented Aug 4, 2020

I'm torn here. What you say isn't unreasonable, but I'm also always opposed to using SI incorrectly, and it's unambiguously the case that "µs" is correct under SI and "us" isn't.

I think I might be more persuaded for Display, but since Debug is developer-focused I think it's fine for them to be expected to configure their system to allow Unicode, since even Windows terminals can do that fine these days.

@clarfonthey
Copy link
Contributor

clarfonthey commented Aug 5, 2020

What about making the standard output use the proper character and requiring the alternative form to show a u instead?

My opinion is that because formatting explicitly is Unciode, the proper unicode character should be used by default. But calling it an alternative form is fine.

@joshtriplett
Copy link
Member Author

Debug renderings may still end up in log files, crash dumps, and other places where "be conservative in what you send" applies.

@clarfonthey
Copy link
Contributor

IMHO, its 2020 and programs should have Unicode support. At minimum the ability to pass along Unicode characters without breaking.

@LukasKalbertodt
Copy link
Member

I agree with @scottmcm and @clarfon and would prefer not merging this.

In addition to what they said, we make hardly any promises about the Debug output and thus it shouldn't be used in any situation where the output matters. For log files and crash dumps I can totally understand using Debug, but in those cases I really can't imagine how UTF8 could cause any problems, especially since those files are basically only viewed by programmers who could deal with the worst case, namely an incorrectly rendered character.

What about making the standard output use the proper character and requiring the alternative form to show a u instead?

I don't think that's a viable solution, especially for Debug, as its docs say "When used with the alternate format specifier #?, the output is pretty-printed." So doing it the other way around here would be strange. This would also mean people printing large structs with :#? now get the u in nested Durations.

@joshtriplett
Copy link
Member Author

@clarfon

IMHO, its 2020 and programs should have Unicode support.

No argument there. But many people in 2020 run software that wasn't designed or written in 2020. There's no excuse in 2020 to write software that doesn't handle Unicode. But there's absolutely a reason in 2020 to write software that doesn't assume Unicode: https://en.wikipedia.org/wiki/Robustness_principle

If you're writing software to run on myriad end-user systems, some of whom may be running a random xterm or rxvt or serial connection, or for that matter a Linux virtual-terminal which does not have full Unicode character support (partly for font reasons), it's reasonable to limit yourself to ASCII by default. This is as far as I know the only case in the Rust standard library that emits non-ASCII characters, and it's a cosmetic usage in a Debug implementation.

The alternative here is not "well, guess I'll just force my users to make Unicode work everywhere or deal with breakage". The alternative here is "well, guess I have to hand-write my own Duration formatting, or more likely wrap it in something that replaces the suffix".

@LukasKalbertodt, on Zulip you said "I guess u is a safe default then"; what changed?

@joshtriplett
Copy link
Member Author

If people really see it as valuable to emit the µ symbol here, then that would require checking if the user's local actually supports that; the usual way to do that on UNIX systems is to use the locale functions to find out if the user's LC_CTYPE is a UTF-8 locale. That is, for instance, what GCC does to decide whether to emit "smart quotes" on the terminal. That's not trivial to check everywhere (and there are different ways to do it on other platforms), but that's the "correct" way to decide if it's safe to write arbitrary Unicode characters to the terminal. (Even then, the user might not have a font with the requisite character, such as on the the Linux console, but it's a more reasonable assumption.)

I don't think that level of checking is worthwhile for emitting µs instead of us, and I don't plan to implement it. I'd like to see the current behavior fixed; this isn't a random feature request, it's a bug report.

@clarfonthey
Copy link
Contributor

Again, if you're running in a place that doesn't support Unicode then I can't imagine it breaking catastrophically. And this sets a precedent for Rust in general to prefer ASCII despite being Unicode by default and I don't think this is reasonable.

What's to say you wouldn't be forcing someone else to write the same kinds of workarounds you're suggesting to convert the u back into a mu?

What if we want to add other Unicode symbols in debug output for brevity? Should we decline doing so because of ASCII-only terminals? What about strings that contain printable Unicode characters that aren't ASCII?

@Manishearth
Copy link
Member

I've definitely seen Unix tools break on Unicode because of misconfigured locales. I use an inordinate amount of unicode on the command line and I've experienced issues many times. Especially when there are multiple terminals involved (e.g. running through screen).

This is the Debug output: its main purpose is to go to the console and work, not look nice. I wouldn't be surprised if logging tools broke with misconfigured locales here.

@joshtriplett
Copy link
Member Author

What if we want to add other Unicode symbols in debug output for brevity? Should we decline doing so because of ASCII-only terminals?

In Debug and Display impls in the standard library, yes.

What about strings that contain printable Unicode characters that aren't ASCII?

We should pass through the user's Unicode data. The standard library just shouldn't add gratuitous Unicode to a user's ASCII string.

@the8472
Copy link
Member

the8472 commented Aug 7, 2020

I've definitely seen Unix tools break on Unicode because of misconfigured locales. I use an inordinate amount of unicode on the command line and I've experienced issues many times. Especially when there are multiple terminals involved (e.g. running through screen).

How would it break in this particular case though? Both µ encodes as [194, 181], both bytes are printable on pretty much any 8bit codepage. Just not pure ascii. Sure, you get mojibake but that's not outright breakage.

@clarfonthey
Copy link
Contributor

In Debug and Display impls in the standard library, yes.

Feel like there should be some sort of formal decision made on this, rather than just one PR. That way these kinds of discussions only happen once.

@LukasKalbertodt
Copy link
Member

@LukasKalbertodt, on Zulip you said "I guess u is a safe default then"; what changed?

I simply thought about it a bit more. However, my opinion on this is not particularly strong either way.

This is the Debug output: its main purpose is to go to the console and work, not look nice.

Well, that argument is not universally true, right? The output is still intended to be human-readable and improving readability is useful IMO. There are quite a few manual Debug impls that exist purely to improve readability of the output.

Feel like there should be some sort of formal decision made on this, rather than just one PR. That way these kinds of discussions only happen once.

I agree. Any idea how to best do this? Create an issue and ping all relevant teams? (Probably just libs?) Or just ping them here?

@Manishearth
Copy link
Member

The output is still intended to be human-readable and improving readability is useful IMO

Things aren't readable if they get mojibake'd or crash the terminal 😄. us is quite common in computer programs, I don't see it as materially less readable over μs. My background is physics and us is used a ton over there as well.

@clarfonthey
Copy link
Contributor

I personally feel strongly about the use of μ over u mostly because the symbol is genuinely different, and it's inconsequential whether they're lexemic to English-speakers. I personally find μ more readable but I can still read us, even though I strongly believe it's wrong.

That said, it's clear that a decision on the use of non-ASCII formatting in the standard library should be decided, and while I'm not sure a proper RFC would be appropriate, I feel that there should be some kind of FCP to make sure everyone's points are heard and this discussion is given a formal decision.

Side note (completely off topic), I do find it amusing that someone with a μ for an avatar is arguing against using μ. :p

@WiSaGaN
Copy link
Contributor

WiSaGaN commented Aug 11, 2020

So is it worth it to bump the requirement of properly displaying std Display/Debug from terminal that only needs to support ASCII to a Unicode supporting terminal over such only case?

On the other hand, if there are a lot of people feeling strongly about it, and we don't have any concrete evidence that this creates significant issue in the wild, keeping it as µs until we have more convincing argument seems reasonable.

@Dylan-DPC-zz Dylan-DPC-zz added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 15, 2020
@Dylan-DPC-zz
Copy link

Marking this as waiting-on-team. The relevant discussion can be found here

@BurntSushi
Copy link
Member

I personally don't have a strong opinion on the clarity of μs vs us. I think the latter is used often enough where its meaning is probably pretty clear.

But I do think it would be unfortunate to declare a ban on all non-ASCII output from all Debug/Display impls in std. It seems like a somewhat extreme action to take. Because of that, I'd like to understand the failure modes better. Could someone post a screenshot or describe in more detail what the failure mode is for a real world example if we write μs and it turns out the consumer can't handle non-ASCII data? I kind of feel like if the answer to that is "it outputs mojibake," then that doesn't seem like such a huge deal to me. Or at least, not worth a blanket ban on non-ASCII output.

@Manishearth
Copy link
Member

FWIW, I am only proposing we do this for Debug impls, not Display impls. "looking good" is not high on the list of priorities for Debug impls.

I'm also not proposing we do this as a blanket ban, I think that we should avoid non-ASCII in Debug when we can. Obviously a non-ascii string should Debug as itself (maybe with escapes for unprintables). But there's a viable alternative here, we should take it.

@BurntSushi
Copy link
Member

@Manishearth Sure, but there are several folks calling for a more general policy, so I wanted to give my thoughts on that.

I still would like to see more details about failure modes though, if possible.

@joshtriplett
Copy link
Member Author

For what it's worth, I'm not advocating a blanket ban on the introduction of non-ASCII characters into otherwise-ASCII user data via Debug or Display, for all time. I'm advocating that this particular usage is not worth the real compatibility issues it causes, and that future proposals for std/core to introduce non-ASCII characters when the user's data is entirely ASCII should have a high bar. The policy I'd suggest is "Debug and Display impls in std should avoid introducing non-ASCII characters in otherwise ASCII data, unless there's a critical reason to do so that outweighs potential compatibility issues".

@BurntSushi I've seen multiple failure modes. One includes not showing the character at all, and either leaving a blank, or leaving nothing (so "10µs" would show up as "10s"). Another involves showing a tiny box character, or a substitution character (a box containing hex digits). Another involves showing one (not all) of the UTF-8 character's bytes as if it were the corresponding iso-8859-1 or cp437 (IBM PC) character. Another failure mode involves overwriting the next character after the UTF-8 character, causing display issues.

@BurntSushi
Copy link
Member

BurntSushi commented Aug 15, 2020

@joshtriplett Thanks for elaborating. Some of those failure modes do sound especially bad, especially when the character is dropped completely. If I were to try to re-create one or more of those failure modes on my machine, how would I go about doing it? (I have a mac and a Windows box for testing that I can use, if necessary.) I'm trying to get a sense of how commonplace it is. Because if we're just going to do this on a case by case basis---which I'm totally fine with---then I think we really need to understand just how badly this is impacting folks.

@clarfonthey
Copy link
Contributor

clarfonthey commented Apr 30, 2023

I don't think that an EC2 bug is a good enough reason to avoid using Unicode in a language ostensibly designed to use Unicode. I don't think that working around others' poor Unicode implementations is acceptable when the failure mode isn't catastrophic (just garbled output, not a failure to run tests), and especially considering it's from a provider who considers themselves big supporters of Rust.

Amazon should be fixing this bug; we shouldn't be "fixing" it for them and giving them less of an excuse to provide proper Unicode support just because their English-speaking customers "don't need" it.

And like, I empathise with being in a miserable situation due to having no control over what cloud provider your workplace uses, and having to deal with bugs that should be fixed that you know won't be fixed, and being caught in that crossfire. But I don't think the solution is to just change the text here; it's to go back to the source and point it out as an actual bug in the EC2 console and not Rust's duty to fix.

@kanashimia
Copy link

It looks like this is the only place in stdlib where non-ASCII is used for output

~/rust/library 
›rg -g '!tests' -g '!test_data' -g '!benches' -g '!tests.rs' -e '[^\x00-\x7F]' | rg -v '^.+?:\s*//'
core/src/num/flt2dec/mod.rs:fast (naïve division and modulo). But it is surprisingly hard to print
core/src/num/uint_macros.rs:        #[doc = concat!("//    3  MAX    (a = 3 × 2^", stringify!($BITS), " + 2^", stringify!($BITS), " - 1)")]
core/src/num/uint_macros.rs:        #[doc = concat!("// +  5    7    (b = 5 × 2^", stringify!($BITS), " + 7)")]
core/src/num/uint_macros.rs:        #[doc = concat!("//    9    6    (sum = 9 × 2^", stringify!($BITS), " + 6)")]
core/src/num/uint_macros.rs:        #[doc = concat!("//    9    6    (a = 9 × 2^", stringify!($BITS), " + 6)")]
core/src/num/uint_macros.rs:        #[doc = concat!("// -  5    7    (b = 5 × 2^", stringify!($BITS), " + 7)")]
core/src/num/uint_macros.rs:        #[doc = concat!("//    3  MAX    (diff = 3 × 2^", stringify!($BITS), " + 2^", stringify!($BITS), " - 1)")]
core/src/num/int_macros.rs:        #[doc = concat!("//   10  MAX    (a = 10 × 2^", stringify!($BITS), " + 2^", stringify!($BITS), " - 1)")]
core/src/num/int_macros.rs:        #[doc = concat!("// + -5    9    (b = -5 × 2^", stringify!($BITS), " + 9)")]
core/src/num/int_macros.rs:        #[doc = concat!("//    6    8    (sum = 6 × 2^", stringify!($BITS), " + 8)")]
core/src/num/int_macros.rs:        #[doc = concat!("//    6    8    (a = 6 × 2^", stringify!($BITS), " + 8)")]
core/src/num/int_macros.rs:        #[doc = concat!("// - -5    9    (b = -5 × 2^", stringify!($BITS), " + 9)")]
core/src/num/int_macros.rs:        #[doc = concat!("//   10  MAX    (diff = 10 × 2^", stringify!($BITS), " + 2^", stringify!($BITS), " - 1)")]
core/src/time.rs:                "µs",
alloc/src/str.rs:            if c == 'Σ' {
alloc/src/str.rs:            debug_assert!('Σ'.len_utf8() == 2);
alloc/src/str.rs:            to.push_str(if is_word_final { "ς" } else { "σ" });
alloc/src/collections/mod.rs:            Enable exposing the allocator’s custom error value \

@joshtriplett
Copy link
Member Author

@clarfonthey Don't assume this is an EC2 bug. It's a bug in one of the many layers between text output and display. It could, for instance, be in the Linux console, or the serial console, or the configuration of one of those.

@joshtriplett
Copy link
Member Author

@BurntSushi Best guess:

Build a statically linked Rust binary for Linux, that opens up the serial port "/dev/ttyS0" and prints random Duration values in the microsecond range to it in a loop, with a slight delay in between so that the serial output is readable. Set up an AWS instance. Boot it with init=/that/rust/binary (so that no other distro initialization code takes place). Watch the output.

@joshtriplett
Copy link
Member Author

joshtriplett commented May 1, 2023

@kanashimia Exactly.

@clarfonthey

good enough reason to avoid using Unicode in a language ostensibly designed to use Unicode

Rust is designed to support Unicode. That doesn't mean it should be designed to force the use of Unicode when the user didn't use it.

Can we uplevel this for a moment? We're evaluating a tradeoff between, as far as I can tell, "this is known to be broken in some circumstances" and (as far as I can tell) "this looks aesthetically prettier". (Previously, we were evaluating a tradeoff between "this is asserted to be broken in some unspecified circumstances" and "this looks aesthetically prettier", but we've moved from an assertion to a specific concrete example.) (See the last paragraph of this comment; I genuinely don't know how to describe this tradeoff better.)

Rust does not, normally, take the strategy of "we use a feature that's mostly supported and we don't care about the people who have systems where it doesn't work". We regularly try to make Rust more robust in the face of OS issues, library issues, environment issues, and all sorts of unusual circumstances. The kinds of workarounds we put in the standard library are much more painful to deal with; on that scale, this is an easy change with zero maintenance effort or overhead.

If the status quo were us and someone was proposing a change to make it µs, there is no way that change would get into the library; the tradeoffs are just not comparable.

It seems like this change is brushing up against some people's values, and resulting in some pushes against it that I'm finding hard to understand. I'd like to understand what those values are. Normally, I understand an opposing position well enough to explain it back, to the satisfaction of those who hold it. Here I genuinely don't think I understand the opposing position well enough to do that. How would people who don't like this change describe the thing being traded off against "this produces broken output on some systems" here?

@the8472
Copy link
Member

the8472 commented May 1, 2023

It seems like this change is brushing up against some people's values [...] Here I genuinely don't think I understand the opposing position well enough to do that.
How would people who don't like this change describe the thing being traded off against "this produces broken output on some systems" here?

  • Avoid papering over other component's bugs in rust if they can be fixed there.
  • don't kill the last unicornde character, don't establish an core-is-ascii-only precedent, don't negotiate with the ascii past holding hostages
  • this is harmless enough that one can hardly argue that it is urgent or even necessary to fix it. The monotonic clock issue was papered over (with a lot of pain) because it crashed user programs

If the status quo were us and someone was proposing a change to make it µs, there is no way that change would get into the library; the tradeoffs are just not comparable.

I think that says more about review biases (as we're doing right now here) than about whether u vs µ is better. And I could at least see myself arguing that to_string_lossy can already produce , so why not add µ when introducing Duration and this is Debug anyway, if you want something else write your own Display etc. etc.

Rust does not, normally, take the strategy of "we use a feature that's mostly supported and we don't care about the people who have systems where it doesn't work"

Unicode is not some platform-specific optional part of Rust though. If a platform doesn't fully support unicode that's a limitation around which user code targeting that platform has to bend, not Rust.

@clarfonthey
Copy link
Contributor

clarfonthey commented May 1, 2023

It looks like this is the only place in stdlib where non-ASCII is used for output

The code you're linking is specific to Unicode lowercase handling, which is different per-character versus per-string since a capital Σ can become a lowercase σ or ς depending on whether the character is at a word-final position. I wouldn't consider this a case of "using Unicode in the standard library" since we're specifically talking about Unicode output in this thread, whereas that concerns Unicode input.

Can we uplevel this for a moment? We're evaluating a tradeoff between, as far as I can tell, "this is known to be broken in some circumstances" and (as far as I can tell) "this looks aesthetically prettier".

I'm not sure how many people have this viewpoint, but my viewpoint is specifically not for aesthetics, but correctness. "us" is not correct; "μs" is. While it has been common for decades to use the Latin alternative due to shortcomings of encodings and keyboards, this has always been a compromise, and never officially part of the SI standard. You can verify this since the BIPM includes all the relevant documents online: https://www.bipm.org/en/measurement-units/si-prefixes

That said, there is one alternative that I only recently found out about, which is to use the deprecated, non-normalised character U+B5 from Latin-1 (ISO 8859-1) which is only a single byte and also compatible with Unicode. This may be more compatible in some situations, but you'd have to test. The main difficulty here is that any form of case-folding or normalisation will convert it to the proper Greek script character, U+3BC.

If the status quo were us and someone was proposing a change to make it µs, there is no way that change would get into the library; the tradeoffs are just not comparable.

I disagree; one Unicode character for another isn't really that big of a tradeoff. Remember, Unicode was first standardised in 1991, meaning it's been over two decades that folks have had time to implement it. Simply interpreting all input as ASCII or Latin-1 is completely unacceptable in 2023. Maintainers of systems which still do this should be called out for the anglocentric BS they ascribe to, since they've had literal decades to fix this. I can't even call them eurocentric since this is a Greek letter we're talking about! Latin-1 had this letter!


Now, we could sidestep this entire thing and use the (also correct) micros, although this would probably prompt using millis and nanos too. And, this would be less familiar to an audience who uses non-Latin languages, although said languages are likely already familiar due to the keywords in Rust & just, language in general.

And I'd imagine that the aesthetics folks would probably prefer "us" over micros just because it's shorter. Personally, I would grumble a lot, but I would not completely object if people decided it was worth it enough to go forward with "micros" instead. Similarly, I'm less a fan of inflexible format impls and would not object to a method on Duration which returns a prefix enum that could be used for folks to format more easily on their own.

The reason why I'm standing my ground so hard here is because this only "solves" the problem for English speakers using these terminals with poor/nonexistent Unicode support, and the real solution is for these people to fix their software. We've already run into the mess with IPv6, another standard which has existed for literal decades but is still lacking adequate support. GitHub itself still only offers an IPv4 address!

I think that forcing those who can "get away" with bad Unicode support to see just how inconveniencing it is is a valid strategy for encouraging further adoption. So, in my humble IMHO, I don't think that we should change it. But I would accept a replacement that is correct like "micros."

Just please don't go "us" and especially don't go "mcs," which I like to read as "millicentiseconds" just to show how bad it is.

@joshtriplett
Copy link
Member Author

joshtriplett commented May 1, 2023

The reason why I'm standing my ground so hard here is because this only "solves" the problem for English speakers using these terminals with poor/nonexistent Unicode support, and the real solution is for these people to fix their software.

We're talking about debugging output here. That debugging output is filled with other ASCII text, and this is literally the only character that isn't. I have debugged software in circumstances where ASCII text would have been a luxury and I was lucky to get a pixel in a corner I could flash on and off to indicate "the routine is still alive, still alive...oh, it's dead now". I've debugged software where the only debug output was a seven-segment display on a board, or a couple of LEDs. There's software out there that flashes keyboard LEDs in morse code for debugging. There's software out there that encodes debugging output into the high bits of the real-time clock because it's persistent across reboots and if you decode it before too much time has passed then the information you encoded is still there.

I absolutely agree with you that being able to just assume Unicode everywhere would be ideal. I absolutely agree with you that any software that doesn't understand Unicode should be fixed. And for presentation to an end user, I'm all for using the correct non-ASCII SI prefix. Any display device that's going to display end-user-readable text needs to be able to present that text in their language. But I don't think debugging output should push the limits of available capabilities. Debugging output exists precisely for cases when everything isn't working the way it's supposed to, and as a result, debugging output should use as few capabilities as absolutely necessary to present information to the developer.

If that means using "micros", fine. If that means spelling it Microseconds(123), fine. Anything that reliably prints.

If I see 1.5��s, the only reason I have any idea what that is is because I know about this particular issue and could make an educated guess based on this being the only SI prefix that uses a non-ASCII character. Someone seeing that who doesn't know about this issue may not have that immediately come to mind. (Is it data or memory corruption? Is it a CR/LF getting stuffed into something that doesn't accept those? Is it a control character from some other layer that's getting injected into my output? Is some random binary getting spewed to the terminal in the background?) But if someone sees 1.5us, they're much more likely to know what it means. Even if they grumble "that's not the right spelling for microseconds", they're still going to have a better idea of what it means than ��s.

@the8472
Copy link
Member

the8472 commented May 1, 2023

That debugging output is filled with other ASCII text, and this is literally the only character that isn't.

That happens to be the case in this narrow scenario. If you don't print durations that doesn't happen. If you print paths it will happen. If you mangle and binary data through to_string_lossy it will happen.

It takes several conjunctions to hit this scenario and even then it isn't fatal.

There's software out there that encodes debugging output into the high bits of the real-time clock because it's persistent across reboots and if you decode it before too much time has passed then the information you encoded is still there.

And those are adaptions to the circumstances by the humans doing debugging. They bent around the broken system and got the information they needed when default debugging facilities weren't sufficient.

That does not mean that std should gain a "write still alive bit to RTC" debugging facility or anything else catering to those cases.

What you're asking for is not just replacing a single character, it is a "core never uses anything other than ASCII" policy for all eternity because going forward we will never be able to prove a negative that every single system that can't handle unicode has disappeared from the face of earth.
In the future we should be allowed to use more unicode, not less.

If I see 1.5��s, the only reason I have any idea what that is is because I know about this particular issue and could make an educated guess based on this being the only SI prefix that uses a non-ASCII character. Someone seeing that who doesn't know about this issue may not have that immediately come to mind.

I highly doubt that someone who deals with linux early-boot debugging and serial consoles suddenly reaches their limit at a few unicode replacement characters (which known to arise from charset issues) and then gives up and never figures out the problem, despite already getting the significant digits out of the duration.

And even if that's somehow the issue we could mention in the Duration documentation that it'll use unicode for formatting instead of removing it.

@BurntSushi
Copy link
Member

BurntSushi commented May 1, 2023

@joshtriplett Thanks for expanding on how this might happen. It definitely seems pretty strange and obscure to me? It would be difficult for me to try and reproduce it I think.

It looks like there are a variety of views here. So I just wanted to try to briefly make my own position clear here:

  • I have a preference to keep using μs (U+03BC) even if there are some circumstances where it doesn't render correctly. I do favor the pragmatic thing here, but IMO, "rendering fails in one particular case" doesn't automatically mean (to me) that the pragmatic choice is to drop it. My preference for μs is I think rooted in it being nicer than us, and by nicer, I mean that it uses the actual SI unit label.
  • I do think "the rendering should be fixed" is in part a valid response to this. I'm less a fan of "using this as a way to get people to fix their shit" (what I would call an activist stance). I'm not sure how much daylight there actually is between these two positions, and I'm also not sure if it matters much.
  • If we don't use μs, then I think we should use us. It's hard for me to gauge whether us is less well understood than μs, but my sense of things is that in the context of printing a duration, anyone that understands what μs means will also understand (or be able to make a reasonable inference) as to what us means.
  • I'm strongly opposed to things like Micros(1.23) or 1.23 micros. That's way more verbose IMO and is firmly in "the cure is worse than the disease" territory for me.
  • I do personally sympathize with the "if we limit this Debug output to ASCII-only, then it's going to set a very strong precedent to do the same for all other Debug outputs" idea. And in some sense, this issue is, whether we like it or not, acting as a proxy for the more general one. I do tend to prefer focusing on specific instances and concrete use cases and making judgment calls given appropriate context, but the importance of precedence is undeniable. And a change like this is absolutely going to set a precedent IMO.

To be more concrete here about what I mean by "have a preference for μs", if there were an FCP on this issue, I probably wouldn't vote in favor of it, but I don't think I'd register a blocking concern for it. (Unless there is some other more compelling argument brought up against this that I'm not yet aware of.) (Also, is this a libs-api issue? Or a libs issue? If the latter, then it doesn't really matter what I'd do in an FCP. Haha.)

@joshtriplett
Copy link
Member Author

joshtriplett commented May 1, 2023

@BurntSushi wrote:

I do think "the rendering should be fixed" is in part a valid response to this. I'm less a fan of "using this as a way to get people to fix their shit" (what I would call an activist stance). I'm not sure how much daylight there actually is between these two positions, and I'm also not sure if it matters much.

I do think there's a lot of daylight between those two stances. I'd agree with the first (the rendering should be fixed), but not the second (we shouldn't be a canary).

"if we limit this Debug output to ASCII-only, then it's going to set a very strong precedent to do the same for all other Debug outputs" idea

With or without this change, I would raise a blocking concern for introducing any new Unicode in a Debug impl in the standard library, and I'd look very carefully at uses of them in Display though I'd be marginally less concerned there. But personally I'd rather treat this as not precedent in either direction, and just decide things on a case-by-case basis. As an example of a case that does seem perfectly fine, for instance, if you're formatting a Unicode string and it already has non-ASCII in it, it doesn't do any harm to have functions that do «/» or 「/」 quoting. Also, something specifically designed to emit Unicode characters (e.g. a "draw this as a tree" function if we wanted that in the standard library) would be fine, since the user specifically asked for that.

Also, is this a libs-api issue? Or a libs issue? If the latter, then it doesn't really matter what I'd do in an FCP. Haha.

I could see it going either way. Certainly a general statement about the future would probably be libs-api. One specific case like this is arguably not API, since the text of Debug output isn't API, but on the other hand, arguably the way many people here are treating it suggests there's more attached to it than an implementation detail?

@joshtriplett
Copy link
Member Author

If you print paths it will happen.

If you print paths, you're going to get characters supplied by the user. If the user fed in Unicode paths, they'll get Unicode output. If they fed in non-Unicode, they may get non-Unicode output (if the program handles paths correctly) or Unicode replacement characters in their output (if it doesn't). But you're not going to get data that didn't come from the user or the user's filesystem.

What makes the mu in impl Debug for Duration notable is that it's introducing Unicode where the user didn't supply any. That is the issue I have with it.

@the8472
Copy link
Member

the8472 commented May 1, 2023

I don't understand the argument. Users generally don't control every aspect of their entire filesystems. So as long as a filesystem supports unicode but the console doesn't then you have a source of unicode bytes in your program that can make it to the display and end up garbled.

Where garbled output comes from doesn't matter. You'll have to use your brain and deal with slightly garbled output in some way. Since the possibility exists I don't see why we need to eliminate it in this particular corner.

File a bug report with AWS, the linux kernel, ignore it, adjust some configuration or change your debug printing (escape it!), recompile the program, redeploy. Those are all options that don't involve changing Rust. Why would changing Rust be the first resort here?

What makes the mu in impl Debug for Duration notable is that it's introducing Unicode where the user didn't supply any.

It's not "introducing unicode". It always prints unicode. Whenever you use println!("{:?}", ...) you've already elected to use unicode strings. It's right there in the type system. Sometimes they just happen to coincide with ascii but the guarantee that it would only ever be that was never made. You're using the wrong API if you want that guarantee. stdout.write_all(), b"", the ascii crate etc. are there if you need that.

@joshtriplett
Copy link
Member Author

joshtriplett commented May 1, 2023

Users generally don't control every aspect of their entire filesystems.

Users are, generally, the sources of non-ASCII filenames, either directly (by creating them) or indirectly (by naming something else that makes its way into a filename, or by obtaining files from some other source). And in any case, you have to have brought those files and filenames into the system somehow. Rust isn't the originator of those non-ASCII characters. Rust is the originator of the non-ASCII character in Duration.

So no, in a program that isn't printing user-controlled potentially arbitrary data, this can be the only source of non-ASCII. This is the one case where we're generating non-ASCII output without getting non-ASCII input.

Since the possibility exists I don't see why we need to eliminate it in this particular corner.

Because then the possibility won't exist anymore.

@the8472
Copy link
Member

the8472 commented May 1, 2023

@BurntSushi Best guess:

Build a statically linked Rust binary for Linux, that opens up the serial port "/dev/ttyS0" and prints random Duration values in the microsecond range to it in a loop, with a slight delay in between so that the serial output is readable. Set up an AWS instance. Boot it with init=/that/rust/binary (so that no other distro initialization code takes place). Watch the output.

Well, systemd does some setup for virtual consoles at least. There probably is similar code for serial ones. So if the rust program replaces the init process it should take on its responsibilities too.

https://github.com/systemd/systemd/blob/5c519a56f5caf73c5d418563af939974b0af43be/src/vconsole/vconsole-setup.c#L231-L256

@albertlarsan68
Copy link
Member

Just bringing my 2cents, on Azerty keyboards (from France) the µ letter is accessible just by a press of Shift and the appropriate key (the same one containing *).

@scottmcm
Copy link
Member

scottmcm commented May 1, 2023

I'm not convinced that even the new example is "completely broken", since "��s" is distinct from any other thing that duration will show. So I think it's a meaningfully different failure mode from, say, it showing as time.idle=94.2s that would be silently wrong. This appears to be a very clear and recognizable "hey, you got your encoding wrong".

We're evaluating a tradeoff between [...] and "this looks prettier".

Note that μs is unambiguously the correct way. It's not that it's prettier, it's that it's the standard-compliant way to write the prefix (https://www.bipm.org/en/publications/si-brochure/ Tableau 7).

I don't think some systems disagreeing between UTF-8 or ISO/IEC 8859-1 or whatever is a good enough reason to not follow the standard when Unicode support is baked so deeply into Rust. (If this was C++ where nobody ever really knows that a char* means, then being 7-bit-clean would be more reasonable.)

@clarfonthey
Copy link
Contributor

clarfonthey commented May 1, 2023

While we're discussing options, I figure I might as well confuse matters even worse by adding another suggestion.

This is maybe controversial but I think that actually formatting the units for the Debug version of Duration is more than I would personally expect for a Debug implementation. To me, something like write!(f, "{}.{:09}s", self.as_secs(), self.subsec_nanos()) would make more sense. Perhaps the existing implementation could be relegated to an alt-flag version.

My logic here is that when debugging durations, you probably are most interested in its exact value, and "helpfully" performing the extra division for the user would require them to re-multiply the units in case they were concerned with the actual value as a seconds + nanoseconds pair. Of course, this isn't the only interpretation that users would like, which is why I think having the existing version would also be helpful as well. (Perhaps being able to return a Displayable version of the duration might be helpful as an extra method, since it would allow users to format themselves without resorting to as_f64 or doing the maths themselves.)

I also lean more toward making Debug implementations "look" more like Rust syntax, so if I were designing a Debug implementation for a library I were writing, I would probably make it look like a call to Duration::new that could be copy-pasted. Of course, I realise that this is just my opinion and not that of the standard library (which, to my knowledge, does not have this precedent), and so I wouldn't expect libstd to follow this trend.

@clarfonthey
Copy link
Contributor

clarfonthey commented May 1, 2023

I do think there's a lot of daylight between those two stances. I'd agree with the first (the rendering should be fixed), but not the second (we shouldn't be a canary).

Even though I have a strong feeling that everyone commenting thus far is aware, I figure it's worth pointing out for everyone reading that the "you" and "we" here in this argument is not specifically referring to anyone speaking here (I have dealt with annoyances like this and I don't explicitly wish it on anyone) but instead referring to "you" and "we" as in the person who is statistically making 3-4x the median wage wherever they live comfortably programming for a living, who will probably be annoyed by the error, but whose life will ultimately remain unaffected.

As folks have said, it's very likely folks will be able to determine what the actual string means, even though it'll be mangled, but it could potentially be enough visibility to get the underlying issue (poor unicode support) fixed. I think that the actual argument is on how to balance the annoyance of the downstream bug(s), the ease of fixing the issue, the correctness of the units, and the likelihood of this change having any effect on whether downstream bugs are actually fixed.

My take is, resummarised: it didn't take me long when I was learning French in primary school to realise that é in the middle of a word was just é being displayed incorrectly, and I wasn't really scratching my head thinking for that long. I now know that it's specifically UTF-8 being interpreted as Latin-1, but that detail isn't super relevant for those needing to work around it, only for those needing to actually fix it. I think we can assume that someone running into this bug will be able to figure it out, considering the circumstances, and thus we should remain on the side of correctness over compromise.

@joshtriplett
Copy link
Member Author

@the8472 wrote:

Well, systemd does some setup for virtual consoles at least. There probably is similar code for serial ones. So if the rust program replaces the init process it should take on its responsibilities too.

I'm running under systemd here; I was trying to give a simple example that's likely to reproduce the problem. In any case, systemd isn't responsible for the serial port in this context, as there's no getty/console running on it, just debugging output.

I'm sure it's possible to do additional work debugging which of the many layers here is failing to handle UTF-8; that's really not the issue here.

@joshtriplett
Copy link
Member Author

@clarfonthey wrote:

This is maybe controversial but I think that actually formatting the units for the Debug version of Duration is more than I would personally expect for a Debug implementation. To me, something like write!(f, "{}.{:09}s", self.as_secs(), self.subsec_nanos()) would make more sense. Perhaps the existing implementation could be relegated to an alt-flag version.

That sounds entirely reasonable. (I'd skip having the alt-flag version, FWIW.) It's simpler, produces correct output in all circumstances, and if someone wants fancier output they can use a time formatting crate (which could then provide whatever level of flexibility they wanted in how to display units).

@the8472
Copy link
Member

the8472 commented May 3, 2023

I'm sure it's possible to do additional work debugging which of the many layers here is failing to handle UTF-8; that's really not the issue here.

Then why not do that instead of rushing to change std? That looks a bit "if you have a hammer everything looks like a nail" issue.

@joshtriplett
Copy link
Member Author

@the8472 If we use a system call that has issues on a range of Linux kernels or Windows versions, we don't take the approach of saying "go fix the OS" and otherwise ignoring the problem. We do tell people to go fix the OS, but we also fix Rust to stop triggering the bug.

@the8472
Copy link
Member

the8472 commented May 8, 2023

We rarely have issues that are only cosmetic. If the syscall bug only had cosmetic consequences on some older systems then I'd also argue in favor of not fixing it.

You outright say you didn't investigate the root cause. Which means the root cause might be a simple configuration problem, which really is not Rust's responsibility.

@workingjubilee
Copy link
Member

regarding μs:

The National (yes, as in the United States) Institute of Science and Technology:

6.1.8 Unacceptability of abbreviations for units

Because acceptable units generally have internationally recognized symbols and names, it is not permissible to use abbreviations for their unit symbols or names, such as sec (for either s or second), sq. mm (for either mm² or square millimeter), cc (for either cm³ or cubic centimeter), mins (for either min or minutes), hrs (for either h or hours), lit (for either L or liter), amps (for either A or amperes), AMU (for either u or unified atomic mass unit), or mps (for either m/s or meter per second). Although the values of quantities are normally expressed using symbols for numbers and symbols for units (see Sec. 7.6), if for some reason the name of a unit is more appropriate than the unit symbol (see Sec. 7.6, note 3), the name of the unit should be spelled out in full.

Emphasis mine.

I believe the "unified atomic mass unit" is also known as the "Dalton", but I can't say whether Da or u is indeed more common.

This means the official view of the NIST is "Thou shalt not print 'microseconds' as anything but microsecond or μs. Not 'micros' or 'mcs', and 'us' is right out!" This is the view of the same organization that has anti-metrication propaganda on their website.

That said, another rendering that is equally precise in meaning seems okay.

@clarfonthey
Copy link
Contributor

clarfonthey commented Oct 6, 2023

I know that this issue is super old, but since it's still not closed, I figured I'd add another argument to the pile for why this isn't a good idea: screen readers. Screen readers will read "us" as, well, "us," whereas "µs" is microseconds. Yes, it's a very small detail and screen reader users are used to things being pronounced incorrectly, but this came up today and I figured it'd be worth adding to the pile.

@tbu-
Copy link
Contributor

tbu- commented Dec 13, 2023

Since this has been open for three years without a decision, I think it should be closed. There is clearly no consensus for merging it.

@the8472
Copy link
Member

the8472 commented Jan 19, 2024

The assigned reviewer has been off the rotation since 2020.

I'm going to close this based on the lack of consensus (thus invoking the status-quo-bias) and my personal preference. If someone feels strongly enough about the issue then I suggest filing a bug. Though it'll likely require some strong arguments to shift opinions on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.