-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Format Duration microseconds with "us" suffix, without Unicode #75065
Conversation
Formatting a Duration currently uses the suffix "µs" for microseconds. However, this does not take into account whether the terminal or other output device can handle Unicode. Make it more robust on legacy terminals by using "us" instead.
(rust_highfive has picked a reviewer for you, use r? to override) |
Context: when developing CLI applications for widespread use, I try to avoid assuming the user's terminal is Unicode-capable without checking first. Checking locale for something as simple as a duration suffix doesn't seem worthwhile, when there's a widely understood ASCII suffix that works as an alternative. As far as I know, no other Display or Debug instance in the Rust standard library includes non-ASCII characters; I think this is the only one. |
I'm torn here. What you say isn't unreasonable, but I'm also always opposed to using SI incorrectly, and it's unambiguously the case that "µs" is correct under SI and "us" isn't. I think I might be more persuaded for |
What about making the standard output use the proper character and requiring the alternative form to show a My opinion is that because formatting explicitly is Unciode, the proper unicode character should be used by default. But calling it an alternative form is fine. |
Debug renderings may still end up in log files, crash dumps, and other places where "be conservative in what you send" applies. |
IMHO, its 2020 and programs should have Unicode support. At minimum the ability to pass along Unicode characters without breaking. |
I agree with @scottmcm and @clarfon and would prefer not merging this. In addition to what they said, we make hardly any promises about the
I don't think that's a viable solution, especially for |
@clarfon
No argument there. But many people in 2020 run software that wasn't designed or written in 2020. There's no excuse in 2020 to write software that doesn't handle Unicode. But there's absolutely a reason in 2020 to write software that doesn't assume Unicode: https://en.wikipedia.org/wiki/Robustness_principle If you're writing software to run on myriad end-user systems, some of whom may be running a random xterm or rxvt or serial connection, or for that matter a Linux virtual-terminal which does not have full Unicode character support (partly for font reasons), it's reasonable to limit yourself to ASCII by default. This is as far as I know the only case in the Rust standard library that emits non-ASCII characters, and it's a cosmetic usage in a The alternative here is not "well, guess I'll just force my users to make Unicode work everywhere or deal with breakage". The alternative here is "well, guess I have to hand-write my own Duration formatting, or more likely wrap it in something that replaces the suffix". @LukasKalbertodt, on Zulip you said "I guess |
If people really see it as valuable to emit the I don't think that level of checking is worthwhile for emitting |
Again, if you're running in a place that doesn't support Unicode then I can't imagine it breaking catastrophically. And this sets a precedent for Rust in general to prefer ASCII despite being Unicode by default and I don't think this is reasonable. What's to say you wouldn't be forcing someone else to write the same kinds of workarounds you're suggesting to convert the u back into a mu? What if we want to add other Unicode symbols in debug output for brevity? Should we decline doing so because of ASCII-only terminals? What about strings that contain printable Unicode characters that aren't ASCII? |
I've definitely seen Unix tools break on Unicode because of misconfigured locales. I use an inordinate amount of unicode on the command line and I've experienced issues many times. Especially when there are multiple terminals involved (e.g. running through screen). This is the Debug output: its main purpose is to go to the console and work, not look nice. I wouldn't be surprised if logging tools broke with misconfigured locales here. |
In
We should pass through the user's Unicode data. The standard library just shouldn't add gratuitous Unicode to a user's ASCII string. |
How would it break in this particular case though? Both µ encodes as |
Feel like there should be some sort of formal decision made on this, rather than just one PR. That way these kinds of discussions only happen once. |
I simply thought about it a bit more. However, my opinion on this is not particularly strong either way.
Well, that argument is not universally true, right? The output is still intended to be human-readable and improving readability is useful IMO. There are quite a few manual
I agree. Any idea how to best do this? Create an issue and ping all relevant teams? (Probably just libs?) Or just ping them here? |
Things aren't readable if they get mojibake'd or crash the terminal 😄. |
I personally feel strongly about the use of μ over u mostly because the symbol is genuinely different, and it's inconsequential whether they're lexemic to English-speakers. I personally find μ more readable but I can still read us, even though I strongly believe it's wrong. That said, it's clear that a decision on the use of non-ASCII formatting in the standard library should be decided, and while I'm not sure a proper RFC would be appropriate, I feel that there should be some kind of FCP to make sure everyone's points are heard and this discussion is given a formal decision. Side note (completely off topic), I do find it amusing that someone with a μ for an avatar is arguing against using μ. :p |
So is it worth it to bump the requirement of properly displaying std On the other hand, if there are a lot of people feeling strongly about it, and we don't have any concrete evidence that this creates significant issue in the wild, keeping it as µs until we have more convincing argument seems reasonable. |
Marking this as waiting-on-team. The relevant discussion can be found here |
I personally don't have a strong opinion on the clarity of But I do think it would be unfortunate to declare a ban on all non-ASCII output from all Debug/Display impls in std. It seems like a somewhat extreme action to take. Because of that, I'd like to understand the failure modes better. Could someone post a screenshot or describe in more detail what the failure mode is for a real world example if we write |
FWIW, I am only proposing we do this for Debug impls, not Display impls. "looking good" is not high on the list of priorities for Debug impls. I'm also not proposing we do this as a blanket ban, I think that we should avoid non-ASCII in Debug when we can. Obviously a non-ascii string should Debug as itself (maybe with escapes for unprintables). But there's a viable alternative here, we should take it. |
@Manishearth Sure, but there are several folks calling for a more general policy, so I wanted to give my thoughts on that. I still would like to see more details about failure modes though, if possible. |
For what it's worth, I'm not advocating a blanket ban on the introduction of non-ASCII characters into otherwise-ASCII user data via Debug or Display, for all time. I'm advocating that this particular usage is not worth the real compatibility issues it causes, and that future proposals for std/core to introduce non-ASCII characters when the user's data is entirely ASCII should have a high bar. The policy I'd suggest is "Debug and Display impls in std should avoid introducing non-ASCII characters in otherwise ASCII data, unless there's a critical reason to do so that outweighs potential compatibility issues". @BurntSushi I've seen multiple failure modes. One includes not showing the character at all, and either leaving a blank, or leaving nothing (so "10µs" would show up as "10s"). Another involves showing a tiny box character, or a substitution character (a box containing hex digits). Another involves showing one (not all) of the UTF-8 character's bytes as if it were the corresponding iso-8859-1 or cp437 (IBM PC) character. Another failure mode involves overwriting the next character after the UTF-8 character, causing display issues. |
@joshtriplett Thanks for elaborating. Some of those failure modes do sound especially bad, especially when the character is dropped completely. If I were to try to re-create one or more of those failure modes on my machine, how would I go about doing it? (I have a mac and a Windows box for testing that I can use, if necessary.) I'm trying to get a sense of how commonplace it is. Because if we're just going to do this on a case by case basis---which I'm totally fine with---then I think we really need to understand just how badly this is impacting folks. |
I don't think that an EC2 bug is a good enough reason to avoid using Unicode in a language ostensibly designed to use Unicode. I don't think that working around others' poor Unicode implementations is acceptable when the failure mode isn't catastrophic (just garbled output, not a failure to run tests), and especially considering it's from a provider who considers themselves big supporters of Rust. Amazon should be fixing this bug; we shouldn't be "fixing" it for them and giving them less of an excuse to provide proper Unicode support just because their English-speaking customers "don't need" it. And like, I empathise with being in a miserable situation due to having no control over what cloud provider your workplace uses, and having to deal with bugs that should be fixed that you know won't be fixed, and being caught in that crossfire. But I don't think the solution is to just change the text here; it's to go back to the source and point it out as an actual bug in the EC2 console and not Rust's duty to fix. |
It looks like this is the only place in stdlib where non-ASCII is used for output
|
@clarfonthey Don't assume this is an EC2 bug. It's a bug in one of the many layers between text output and display. It could, for instance, be in the Linux console, or the serial console, or the configuration of one of those. |
@BurntSushi Best guess: Build a statically linked Rust binary for Linux, that opens up the serial port "/dev/ttyS0" and prints random Duration values in the microsecond range to it in a loop, with a slight delay in between so that the serial output is readable. Set up an AWS instance. Boot it with |
@kanashimia Exactly.
Rust is designed to support Unicode. That doesn't mean it should be designed to force the use of Unicode when the user didn't use it. Can we uplevel this for a moment? We're evaluating a tradeoff between, as far as I can tell, "this is known to be broken in some circumstances" and (as far as I can tell) "this looks aesthetically prettier". (Previously, we were evaluating a tradeoff between "this is asserted to be broken in some unspecified circumstances" and "this looks aesthetically prettier", but we've moved from an assertion to a specific concrete example.) (See the last paragraph of this comment; I genuinely don't know how to describe this tradeoff better.) Rust does not, normally, take the strategy of "we use a feature that's mostly supported and we don't care about the people who have systems where it doesn't work". We regularly try to make Rust more robust in the face of OS issues, library issues, environment issues, and all sorts of unusual circumstances. The kinds of workarounds we put in the standard library are much more painful to deal with; on that scale, this is an easy change with zero maintenance effort or overhead. If the status quo were It seems like this change is brushing up against some people's values, and resulting in some pushes against it that I'm finding hard to understand. I'd like to understand what those values are. Normally, I understand an opposing position well enough to explain it back, to the satisfaction of those who hold it. Here I genuinely don't think I understand the opposing position well enough to do that. How would people who don't like this change describe the thing being traded off against "this produces broken output on some systems" here? |
I think that says more about review biases (as we're doing right now here) than about whether u vs µ is better. And I could at least see myself arguing that
Unicode is not some platform-specific optional part of Rust though. If a platform doesn't fully support unicode that's a limitation around which user code targeting that platform has to bend, not Rust. |
The code you're linking is specific to Unicode lowercase handling, which is different per-character versus per-string since a capital Σ can become a lowercase σ or ς depending on whether the character is at a word-final position. I wouldn't consider this a case of "using Unicode in the standard library" since we're specifically talking about Unicode output in this thread, whereas that concerns Unicode input.
I'm not sure how many people have this viewpoint, but my viewpoint is specifically not for aesthetics, but correctness. "us" is not correct; "μs" is. While it has been common for decades to use the Latin alternative due to shortcomings of encodings and keyboards, this has always been a compromise, and never officially part of the SI standard. You can verify this since the BIPM includes all the relevant documents online: https://www.bipm.org/en/measurement-units/si-prefixes That said, there is one alternative that I only recently found out about, which is to use the deprecated, non-normalised character U+B5 from Latin-1 (ISO 8859-1) which is only a single byte and also compatible with Unicode. This may be more compatible in some situations, but you'd have to test. The main difficulty here is that any form of case-folding or normalisation will convert it to the proper Greek script character, U+3BC.
I disagree; one Unicode character for another isn't really that big of a tradeoff. Remember, Unicode was first standardised in 1991, meaning it's been over two decades that folks have had time to implement it. Simply interpreting all input as ASCII or Latin-1 is completely unacceptable in 2023. Maintainers of systems which still do this should be called out for the anglocentric BS they ascribe to, since they've had literal decades to fix this. I can't even call them eurocentric since this is a Greek letter we're talking about! Latin-1 had this letter! Now, we could sidestep this entire thing and use the (also correct) micros, although this would probably prompt using millis and nanos too. And, this would be less familiar to an audience who uses non-Latin languages, although said languages are likely already familiar due to the keywords in Rust & just, language in general. And I'd imagine that the aesthetics folks would probably prefer "us" over micros just because it's shorter. Personally, I would grumble a lot, but I would not completely object if people decided it was worth it enough to go forward with "micros" instead. Similarly, I'm less a fan of inflexible format impls and would not object to a method on The reason why I'm standing my ground so hard here is because this only "solves" the problem for English speakers using these terminals with poor/nonexistent Unicode support, and the real solution is for these people to fix their software. We've already run into the mess with IPv6, another standard which has existed for literal decades but is still lacking adequate support. GitHub itself still only offers an IPv4 address! I think that forcing those who can "get away" with bad Unicode support to see just how inconveniencing it is is a valid strategy for encouraging further adoption. So, in my humble IMHO, I don't think that we should change it. But I would accept a replacement that is correct like "micros." Just please don't go "us" and especially don't go "mcs," which I like to read as "millicentiseconds" just to show how bad it is. |
We're talking about debugging output here. That debugging output is filled with other ASCII text, and this is literally the only character that isn't. I have debugged software in circumstances where ASCII text would have been a luxury and I was lucky to get a pixel in a corner I could flash on and off to indicate "the routine is still alive, still alive...oh, it's dead now". I've debugged software where the only debug output was a seven-segment display on a board, or a couple of LEDs. There's software out there that flashes keyboard LEDs in morse code for debugging. There's software out there that encodes debugging output into the high bits of the real-time clock because it's persistent across reboots and if you decode it before too much time has passed then the information you encoded is still there. I absolutely agree with you that being able to just assume Unicode everywhere would be ideal. I absolutely agree with you that any software that doesn't understand Unicode should be fixed. And for presentation to an end user, I'm all for using the correct non-ASCII SI prefix. Any display device that's going to display end-user-readable text needs to be able to present that text in their language. But I don't think debugging output should push the limits of available capabilities. Debugging output exists precisely for cases when everything isn't working the way it's supposed to, and as a result, debugging output should use as few capabilities as absolutely necessary to present information to the developer. If that means using "micros", fine. If that means spelling it If I see |
That happens to be the case in this narrow scenario. If you don't print durations that doesn't happen. If you print paths it will happen. If you mangle and binary data through It takes several conjunctions to hit this scenario and even then it isn't fatal.
And those are adaptions to the circumstances by the humans doing debugging. They bent around the broken system and got the information they needed when default debugging facilities weren't sufficient. That does not mean that std should gain a "write still alive bit to RTC" debugging facility or anything else catering to those cases. What you're asking for is not just replacing a single character, it is a "core never uses anything other than ASCII" policy for all eternity because going forward we will never be able to prove a negative that every single system that can't handle unicode has disappeared from the face of earth.
I highly doubt that someone who deals with linux early-boot debugging and serial consoles suddenly reaches their limit at a few unicode replacement characters (which known to arise from charset issues) and then gives up and never figures out the problem, despite already getting the significant digits out of the duration. And even if that's somehow the issue we could mention in the |
@joshtriplett Thanks for expanding on how this might happen. It definitely seems pretty strange and obscure to me? It would be difficult for me to try and reproduce it I think. It looks like there are a variety of views here. So I just wanted to try to briefly make my own position clear here:
To be more concrete here about what I mean by "have a preference for |
@BurntSushi wrote:
I do think there's a lot of daylight between those two stances. I'd agree with the first (the rendering should be fixed), but not the second (we shouldn't be a canary).
With or without this change, I would raise a blocking concern for introducing any new Unicode in a
I could see it going either way. Certainly a general statement about the future would probably be libs-api. One specific case like this is arguably not API, since the text of Debug output isn't API, but on the other hand, arguably the way many people here are treating it suggests there's more attached to it than an implementation detail? |
If you print paths, you're going to get characters supplied by the user. If the user fed in Unicode paths, they'll get Unicode output. If they fed in non-Unicode, they may get non-Unicode output (if the program handles paths correctly) or Unicode replacement characters in their output (if it doesn't). But you're not going to get data that didn't come from the user or the user's filesystem. What makes the mu in |
I don't understand the argument. Users generally don't control every aspect of their entire filesystems. So as long as a filesystem supports unicode but the console doesn't then you have a source of unicode bytes in your program that can make it to the display and end up garbled. Where garbled output comes from doesn't matter. You'll have to use your brain and deal with slightly garbled output in some way. Since the possibility exists I don't see why we need to eliminate it in this particular corner. File a bug report with AWS, the linux kernel, ignore it, adjust some configuration or change your debug printing (escape it!), recompile the program, redeploy. Those are all options that don't involve changing Rust. Why would changing Rust be the first resort here?
It's not "introducing unicode". It always prints unicode. Whenever you use |
Users are, generally, the sources of non-ASCII filenames, either directly (by creating them) or indirectly (by naming something else that makes its way into a filename, or by obtaining files from some other source). And in any case, you have to have brought those files and filenames into the system somehow. Rust isn't the originator of those non-ASCII characters. Rust is the originator of the non-ASCII character in So no, in a program that isn't printing user-controlled potentially arbitrary data, this can be the only source of non-ASCII. This is the one case where we're generating non-ASCII output without getting non-ASCII input.
Because then the possibility won't exist anymore. |
Well, systemd does some setup for virtual consoles at least. There probably is similar code for serial ones. So if the rust program replaces the init process it should take on its responsibilities too. |
Just bringing my 2cents, on Azerty keyboards (from France) the µ letter is accessible just by a press of Shift and the appropriate key (the same one containing |
I'm not convinced that even the new example is "completely broken", since "��s" is distinct from any other thing that duration will show. So I think it's a meaningfully different failure mode from, say, it showing as
Note that I don't think some systems disagreeing between UTF-8 or ISO/IEC 8859-1 or whatever is a good enough reason to not follow the standard when Unicode support is baked so deeply into Rust. (If this was C++ where nobody ever really knows that a |
While we're discussing options, I figure I might as well confuse matters even worse by adding another suggestion. This is maybe controversial but I think that actually formatting the units for the My logic here is that when debugging durations, you probably are most interested in its exact value, and "helpfully" performing the extra division for the user would require them to re-multiply the units in case they were concerned with the actual value as a seconds + nanoseconds pair. Of course, this isn't the only interpretation that users would like, which is why I think having the existing version would also be helpful as well. (Perhaps being able to return a I also lean more toward making |
Even though I have a strong feeling that everyone commenting thus far is aware, I figure it's worth pointing out for everyone reading that the "you" and "we" here in this argument is not specifically referring to anyone speaking here (I have dealt with annoyances like this and I don't explicitly wish it on anyone) but instead referring to "you" and "we" as in the person who is statistically making 3-4x the median wage wherever they live comfortably programming for a living, who will probably be annoyed by the error, but whose life will ultimately remain unaffected. As folks have said, it's very likely folks will be able to determine what the actual string means, even though it'll be mangled, but it could potentially be enough visibility to get the underlying issue (poor unicode support) fixed. I think that the actual argument is on how to balance the annoyance of the downstream bug(s), the ease of fixing the issue, the correctness of the units, and the likelihood of this change having any effect on whether downstream bugs are actually fixed. My take is, resummarised: it didn't take me long when I was learning French in primary school to realise that é in the middle of a word was just é being displayed incorrectly, and I wasn't really scratching my head thinking for that long. I now know that it's specifically UTF-8 being interpreted as Latin-1, but that detail isn't super relevant for those needing to work around it, only for those needing to actually fix it. I think we can assume that someone running into this bug will be able to figure it out, considering the circumstances, and thus we should remain on the side of correctness over compromise. |
@the8472 wrote:
I'm running under systemd here; I was trying to give a simple example that's likely to reproduce the problem. In any case, systemd isn't responsible for the serial port in this context, as there's no getty/console running on it, just debugging output. I'm sure it's possible to do additional work debugging which of the many layers here is failing to handle UTF-8; that's really not the issue here. |
@clarfonthey wrote:
That sounds entirely reasonable. (I'd skip having the alt-flag version, FWIW.) It's simpler, produces correct output in all circumstances, and if someone wants fancier output they can use a time formatting crate (which could then provide whatever level of flexibility they wanted in how to display units). |
Then why not do that instead of rushing to change std? That looks a bit "if you have a hammer everything looks like a nail" issue. |
@the8472 If we use a system call that has issues on a range of Linux kernels or Windows versions, we don't take the approach of saying "go fix the OS" and otherwise ignoring the problem. We do tell people to go fix the OS, but we also fix Rust to stop triggering the bug. |
We rarely have issues that are only cosmetic. If the syscall bug only had cosmetic consequences on some older systems then I'd also argue in favor of not fixing it. You outright say you didn't investigate the root cause. Which means the root cause might be a simple configuration problem, which really is not Rust's responsibility. |
regarding μs: The National (yes, as in the United States) Institute of Science and Technology:
Emphasis mine. I believe the "unified atomic mass unit" is also known as the "Dalton", but I can't say whether This means the official view of the NIST is "Thou shalt not print 'microseconds' as anything but microsecond or μs. Not 'micros' or 'mcs', and 'us' is right out!" This is the view of the same organization that has anti-metrication propaganda on their website. That said, another rendering that is equally precise in meaning seems okay. |
I know that this issue is super old, but since it's still not closed, I figured I'd add another argument to the pile for why this isn't a good idea: screen readers. Screen readers will read "us" as, well, "us," whereas "µs" is microseconds. Yes, it's a very small detail and screen reader users are used to things being pronounced incorrectly, but this came up today and I figured it'd be worth adding to the pile. |
Since this has been open for three years without a decision, I think it should be closed. There is clearly no consensus for merging it. |
The assigned reviewer has been off the rotation since 2020. I'm going to close this based on the lack of consensus (thus invoking the status-quo-bias) and my personal preference. If someone feels strongly enough about the issue then I suggest filing a bug. Though it'll likely require some strong arguments to shift opinions on this. |
Formatting a Duration currently uses the suffix "µs" for microseconds.
However, this does not take into account whether the terminal or other
output device can handle Unicode. Make it more robust on legacy
terminals by using "us" instead.