Short time format is missing space under en_US on Fedora 38+ #83571

tmds · 2023-03-17T06:11:06Z

The short time format on Fedora 38 has replaced its breaking space by a 'NARROW NO-BREAK SPACE' (U+202F).

That space gets removed by:

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/CultureData.Icu.cs

Line 225 in 34c0472

    
           private static string ConvertIcuTimeFormatString(ReadOnlySpan<char> icuFormatString)

Any character not explicitly recognized by this function (like U+202F) gets removed.

This function includes a specific case for a regular 'NO-BREAK SPACE' (U+00A0):

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/CultureData.Icu.cs

Lines 261 to 264 in 34c0472

    
           case '\u00A0': 
        
               // Convert nonbreaking spaces into regular spaces 
        
               result[resultPos++] = ' '; 
        
               break;

There are two ways to fix this:

Instead of removing unknown characters, we pass characters (including these non-breaking spaces) as is.
Or, we add U+202F so it also gets converted to a regular space.

I don't know why the current implementation opted for the second option for U+00A0.
I think it may be to have the same time format as on Windows under en_US.

I have a slight preference for the first option, because the second is overwriting part of the format information from icu. And, the second option would have prevented this issue from occurring.

What is the preferred option?

cc @omajid

The text was updated successfully, but these errors were encountered:

ghost · 2023-03-17T06:11:13Z

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

The short time format on Fedora 38 has replaced its breaking space by a 'NARROW NO-BREAK SPACE' (U+202F).

That space gets removed by:

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/CultureData.Icu.cs

Line 225 in 34c0472

    
           private static string ConvertIcuTimeFormatString(ReadOnlySpan<char> icuFormatString)

Any character not explicitly recognized by this function (like U+202F) gets removed.

This function includes a specific case for a regular 'NO-BREAK SPACE' (U+00A0):

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/CultureData.Icu.cs

Lines 261 to 264 in 34c0472

    
           case '\u00A0': 
        
               // Convert nonbreaking spaces into regular spaces 
        
               result[resultPos++] = ' '; 
        
               break;

There are two ways to fix this:

Instead of removing unknown characters, we pass characters (including these non-breaking spaces) as is.
Or, we add U+202F so it also gets converted to a regular space.

I don't know why the current implementation opted for the second option for U+00A0.
I have a slight preference for the first option, because the second is overwriting part of the format information from icu.

What is the preferred option?

cc @omajid

Author:	tmds
Assignees:	-
Labels:	`area-System.Globalization`
Milestone:	-

tmds · 2023-03-17T06:11:37Z

cc @stephentoub

stephentoub · 2023-03-17T11:01:44Z

I don't know why the current implementation opted for the second option for U+00A0.
I think it may be to have the same time format as on Windows under en_US.

@ellismg, do you remember?

Instead of removing unknown characters, we pass characters (including these non-breaking spaces) as is.

What breaks if we do that?

tmds · 2023-03-17T11:53:58Z

What breaks if we do that?

It affects formatting: non-breaking spaces don't break to the next line.

It breaks tests that expect the space to be used on Linux in the en-US time format, like on Windows.

I was wondering about something.

using System.Text;

foreach (string s in new string[] { "'\u202F'", "'\u00A0'" })
{
    Console.WriteLine(s);
    Console.WriteLine(AsciiRoundTrip(s));
}

static string AsciiRoundTrip(string s) =>
    Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s));

This prints:

' '
'?'
' '
'?'

It seems neither of these non-breaking spaces get converted to regular spaces when converting to ASCII.
If that's the intended behavior for the ASCII conversion(?), it seems desired to not include non-breaking spaces for ASCII compatibility.

tmds · 2023-03-17T13:40:47Z

Based on the above, I assume that, for ASCII conversion purposes, we want regular spaces in the time format.

Clockwork-Muse · 2023-03-17T15:04:26Z

If that's the intended behavior for the ASCII conversion(?), it seems desired to not include non-breaking spaces for ASCII compatibility.

Yeah, but that's ASCII. Format strings can have any number of non-ASCII characters (eg, the year/month/day symbols in Japanese), so doing it at the ICU ingest step seems wrong.

tmds · 2023-03-17T15:35:00Z

so doing it at the ICU ingest step seems wrong.

I agree.
I would also prefer we use the format strings as they come from icu.
That is the correct choice.

The non-breaking space case was already there.
The rationale isn't mentioned in the code. It's my guess it's for using a space that exists in ASCII.
Adding another type of non-breaking space is the safe choice as it minimizes breaking anyone.

Format strings can have any number of non-ASCII characters

Any such characters are currently filtered out in ConvertIcuTimeFormatString.

stephentoub · 2023-03-17T15:38:00Z

I would also prefer we use the format strings as they come from icu. That is the correct choice.

You're suggesting we have two completely different parsing / formatting code paths based on whether the format string came from a user or from ICU? What about places where these format strings are returned to the user and need to be the .NET format? I don't see how this is a viable or correct choice. Have I totally misunderstood?

tmds · 2023-03-17T15:40:39Z

This function takes the format string that comes from icu, and translate it in to the equivalent .NET format string.

stephentoub · 2023-03-17T15:42:09Z

Right. It sounded like you were saying you didn't want that done at all. I probably misunderstood... what did you mean by "I would also prefer we use the format strings as they come from icu. That is the correct choice."

tmds · 2023-03-17T15:43:50Z

The function (ConvertIcuTimeFormatString) is currently replacing the non-breaking spaces with regular spaces, and skipping unknown characters. It could pass them through.

stephentoub · 2023-03-17T15:45:18Z

The function (ConvertIcuTimeFormatString) is currently replacing the non-breaking spaces with regular spaces, and skipping unknown characters. It could pass them through.

Right, I understand.

Is that the extent of what you meant? That for anything not understood it should just be passed through, rather than not translating anything? I understood your comment to mean you wished zero translation was being done.

tmds · 2023-03-17T15:48:35Z

Is that the extent of what you meant?

Yes.

And I agree with @Clockwork-Muse that this is the correct thing.

For the same reason the non-breaking space case exist already, we may want to add an additional case (mapping 'U+202F' to ' '), and avoid making a backwards incompatible change.

tarekgh · 2023-03-17T16:05:06Z

we may want to add an additional case (mapping 'U+202F' to ' '),

This is wrong. CLDR intentionally added 0+202f to prevent the formatted text from wrapping in the middle. We shouldn't change the space. IIRC we already read 'U+202F' in the number formats too.

The change here is not only regarding the format, but we also want to ensure parsing should handle that too.

tmds · 2023-03-17T16:11:35Z

@tarekgh we should also remove the existing case for U+00A0 then?

tarekgh · 2023-03-17T16:16:25Z

CLDR is already changing the usage of U+00A0 to U+202F so this is not going to be important whether we change it or not. The important thing here is we want to ensure the parsing is tolerant for all these space variations.

tarekgh · 2023-03-17T16:19:06Z

Yeah, but that's ASCII. Format strings can have any number of non-ASCII characters (eg, the year/month/day symbols in Japanese),

The case here is time format and not having any date formats.

tmds · 2023-03-17T16:21:04Z

With my question I actually meant to ask: it was not necessary to map U+00A0 to a regular space?

tarekgh · 2023-03-17T16:43:54Z

it was not necessary to map U+00A0 to a regular space?

It was helpful to avoid the parsing problem, but this was not correct to convert it to regular space.

I see you have opened a PR which I think need some more work. Are you willing to help get the correct fix?

Just for reference, ICU had a similar issue too. https://unicode-org.atlassian.net/browse/ICU-20067

tarekgh · 2023-03-17T16:47:58Z

Here we handle these spaces in the number parsing

runtime/src/libraries/System.Private.CoreLib/src/System/Number.Parsing.cs

Line 2665 in 6ef9d10

    
           private static bool IsSpaceReplacingChar(char c) => c == '\u00a0' || c == '\u202f';

Clockwork-Muse · 2023-03-17T17:23:08Z

The case here is time format and not having any date formats.

Okay, fine, you can also have non-ASCII hour/minute/second characters in Japanese too.

tarekgh · 2023-03-17T17:45:14Z

I don't think ICU has such formats, otherwise we would see it.

Clockwork-Muse · 2023-03-17T19:16:47Z

.... you appear to be right, they don't seem to be doing localized time formats (other than AM/PM markers, which would be substituted in the format string, so not the same thing). I wonder why, when they have localized date formats.

tmds · 2023-03-20T10:20:13Z

I see you have opened a PR which I think need some more work. Are you willing to help get the correct fix?

Yes, I will update the PR this week based on this discussion.

dotnet-issue-labeler bot added the area-System.Globalization label Mar 17, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Mar 17, 2023

tmds removed the untriaged New issue has not been triaged by the area owner label Mar 17, 2023

tmds mentioned this issue Mar 17, 2023

UnitTests.Semantics.BinaryOperators test failure on Fedora rawhide dotnet/roslyn#67101

Open

tmds mentioned this issue Mar 17, 2023

ConvertIcuTimeFormatString: convert narrow no-break spaces to spaces too. #83589

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 17, 2023

tarekgh added this to the 8.0.0 milestone Mar 17, 2023

tarekgh closed this as completed in #83589 Apr 4, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 4, 2023

tmds mentioned this issue Apr 18, 2023

Improve DateTime{Offset} formatting further in a variety of cases #84963

Merged

ghost locked as resolved and limited conversation to collaborators May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short time format is missing space under en_US on Fedora 38+ #83571

Short time format is missing space under en_US on Fedora 38+ #83571

tmds commented Mar 17, 2023 •

edited

Loading

ghost commented Mar 17, 2023

tmds commented Mar 17, 2023

stephentoub commented Mar 17, 2023

tmds commented Mar 17, 2023

tmds commented Mar 17, 2023

Clockwork-Muse commented Mar 17, 2023

tmds commented Mar 17, 2023

stephentoub commented Mar 17, 2023

tmds commented Mar 17, 2023

stephentoub commented Mar 17, 2023

tmds commented Mar 17, 2023

stephentoub commented Mar 17, 2023

tmds commented Mar 17, 2023 •

edited

Loading

tarekgh commented Mar 17, 2023 •

edited

Loading

tmds commented Mar 17, 2023

tarekgh commented Mar 17, 2023

tarekgh commented Mar 17, 2023

tmds commented Mar 17, 2023

tarekgh commented Mar 17, 2023

tarekgh commented Mar 17, 2023

Clockwork-Muse commented Mar 17, 2023

tarekgh commented Mar 17, 2023

Clockwork-Muse commented Mar 17, 2023

tmds commented Mar 20, 2023

Short time format is missing space under en_US on Fedora 38+ #83571

Short time format is missing space under en_US on Fedora 38+ #83571

Comments

tmds commented Mar 17, 2023 • edited Loading

ghost commented Mar 17, 2023

tmds commented Mar 17, 2023

stephentoub commented Mar 17, 2023

tmds commented Mar 17, 2023

tmds commented Mar 17, 2023

Clockwork-Muse commented Mar 17, 2023

tmds commented Mar 17, 2023

stephentoub commented Mar 17, 2023

tmds commented Mar 17, 2023

stephentoub commented Mar 17, 2023

tmds commented Mar 17, 2023

stephentoub commented Mar 17, 2023

tmds commented Mar 17, 2023 • edited Loading

tarekgh commented Mar 17, 2023 • edited Loading

tmds commented Mar 17, 2023

tarekgh commented Mar 17, 2023

tarekgh commented Mar 17, 2023

tmds commented Mar 17, 2023

tarekgh commented Mar 17, 2023

tarekgh commented Mar 17, 2023

Clockwork-Muse commented Mar 17, 2023

tarekgh commented Mar 17, 2023

Clockwork-Muse commented Mar 17, 2023

tmds commented Mar 20, 2023

tmds commented Mar 17, 2023 •

edited

Loading

tmds commented Mar 17, 2023 •

edited

Loading

tarekgh commented Mar 17, 2023 •

edited

Loading