-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Short time format is missing space under en_US on Fedora 38+ #83571
Comments
Tagging subscribers to this area: @dotnet/area-system-globalization Issue DetailsThe short time format on Fedora 38 has replaced its breaking space by a 'NARROW NO-BREAK SPACE' ( That space gets removed by: runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/CultureData.Icu.cs Line 225 in 34c0472
Any character not explicitly recognized by this function (like This function includes a specific case for a regular 'NO-BREAK SPACE' ( runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/CultureData.Icu.cs Lines 261 to 264 in 34c0472
There are two ways to fix this:
I don't know why the current implementation opted for the second option for What is the preferred option? cc @omajid
|
cc @stephentoub |
@ellismg, do you remember?
What breaks if we do that? |
It affects formatting: non-breaking spaces don't break to the next line. It breaks tests that expect the space to be used on Linux in the I was wondering about something. using System.Text;
foreach (string s in new string[] { "'\u202F'", "'\u00A0'" })
{
Console.WriteLine(s);
Console.WriteLine(AsciiRoundTrip(s));
}
static string AsciiRoundTrip(string s) =>
Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s)); This prints:
It seems neither of these non-breaking spaces get converted to regular spaces when converting to ASCII. |
Based on the above, I assume that, for |
Yeah, but that's ASCII. Format strings can have any number of non-ASCII characters (eg, the year/month/day symbols in Japanese), so doing it at the ICU ingest step seems wrong. |
I agree. The non-breaking space case was already there.
Any such characters are currently filtered out in |
You're suggesting we have two completely different parsing / formatting code paths based on whether the format string came from a user or from ICU? What about places where these format strings are returned to the user and need to be the .NET format? I don't see how this is a viable or correct choice. Have I totally misunderstood? |
This function takes the format string that comes from icu, and translate it in to the equivalent .NET format string. |
Right. It sounded like you were saying you didn't want that done at all. I probably misunderstood... what did you mean by "I would also prefer we use the format strings as they come from icu. That is the correct choice." |
The function ( |
Right, I understand. Is that the extent of what you meant? That for anything not understood it should just be passed through, rather than not translating anything? I understood your comment to mean you wished zero translation was being done. |
Yes. And I agree with @Clockwork-Muse that this is the correct thing. For the same reason the non-breaking space case exist already, we may want to add an additional case (mapping |
This is wrong. CLDR intentionally added 0+202f to prevent the formatted text from wrapping in the middle. We shouldn't change the space. IIRC we already read 'U+202F' in the number formats too. The change here is not only regarding the format, but we also want to ensure parsing should handle that too. |
@tarekgh we should also remove the existing case for |
CLDR is already changing the usage of |
The case here is time format and not having any date formats. |
With my question I actually meant to ask: it was not necessary to map |
It was helpful to avoid the parsing problem, but this was not correct to convert it to regular space. I see you have opened a PR which I think need some more work. Are you willing to help get the correct fix? Just for reference, ICU had a similar issue too. https://unicode-org.atlassian.net/browse/ICU-20067 |
Here we handle these spaces in the number parsing
|
Okay, fine, you can also have non-ASCII hour/minute/second characters in Japanese too. |
I don't think ICU has such formats, otherwise we would see it. |
.... you appear to be right, they don't seem to be doing localized time formats (other than AM/PM markers, which would be substituted in the format string, so not the same thing). I wonder why, when they have localized date formats. |
Yes, I will update the PR this week based on this discussion. |
The short time format on Fedora 38 has replaced its breaking space by a 'NARROW NO-BREAK SPACE' (
U+202F
).That space gets removed by:
runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/CultureData.Icu.cs
Line 225 in 34c0472
Any character not explicitly recognized by this function (like
U+202F
) gets removed.This function includes a specific case for a regular 'NO-BREAK SPACE' (
U+00A0
):runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/CultureData.Icu.cs
Lines 261 to 264 in 34c0472
There are two ways to fix this:
U+202F
so it also gets converted to a regular space.I don't know why the current implementation opted for the second option for
U+00A0
.I think it may be to have the same time format as on Windows under
en_US
.I have a slight preference for the first option, because the second is overwriting part of the format information from icu. And, the second option would have prevented this issue from occurring.
What is the preferred option?
cc @omajid
The text was updated successfully, but these errors were encountered: