Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short time format is missing space under en_US on Fedora 38+ #83571

Closed
tmds opened this issue Mar 17, 2023 · 24 comments · Fixed by #83589
Closed

Short time format is missing space under en_US on Fedora 38+ #83571

tmds opened this issue Mar 17, 2023 · 24 comments · Fixed by #83589

Comments

@tmds
Copy link
Member

tmds commented Mar 17, 2023

The short time format on Fedora 38 has replaced its breaking space by a 'NARROW NO-BREAK SPACE' (U+202F).

That space gets removed by:

private static string ConvertIcuTimeFormatString(ReadOnlySpan<char> icuFormatString)

Any character not explicitly recognized by this function (like U+202F) gets removed.

This function includes a specific case for a regular 'NO-BREAK SPACE' (U+00A0):

case '\u00A0':
// Convert nonbreaking spaces into regular spaces
result[resultPos++] = ' ';
break;

There are two ways to fix this:

  • Instead of removing unknown characters, we pass characters (including these non-breaking spaces) as is.
  • Or, we add U+202F so it also gets converted to a regular space.

I don't know why the current implementation opted for the second option for U+00A0.
I think it may be to have the same time format as on Windows under en_US.

I have a slight preference for the first option, because the second is overwriting part of the format information from icu. And, the second option would have prevented this issue from occurring.

What is the preferred option?

cc @omajid

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Mar 17, 2023
@ghost
Copy link

ghost commented Mar 17, 2023

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

The short time format on Fedora 38 has replaced its breaking space by a 'NARROW NO-BREAK SPACE' (U+202F).

That space gets removed by:

private static string ConvertIcuTimeFormatString(ReadOnlySpan<char> icuFormatString)

Any character not explicitly recognized by this function (like U+202F) gets removed.

This function includes a specific case for a regular 'NO-BREAK SPACE' (U+00A0):

case '\u00A0':
// Convert nonbreaking spaces into regular spaces
result[resultPos++] = ' ';
break;

There are two ways to fix this:

  • Instead of removing unknown characters, we pass characters (including these non-breaking spaces) as is.
  • Or, we add U+202F so it also gets converted to a regular space.

I don't know why the current implementation opted for the second option for U+00A0.
I have a slight preference for the first option, because the second is overwriting part of the format information from icu.

What is the preferred option?

cc @omajid

Author: tmds
Assignees: -
Labels:

area-System.Globalization

Milestone: -

@tmds tmds removed the untriaged New issue has not been triaged by the area owner label Mar 17, 2023
@tmds
Copy link
Member Author

tmds commented Mar 17, 2023

cc @stephentoub

@stephentoub
Copy link
Member

I don't know why the current implementation opted for the second option for U+00A0.
I think it may be to have the same time format as on Windows under en_US.

@ellismg, do you remember?

Instead of removing unknown characters, we pass characters (including these non-breaking spaces) as is.

What breaks if we do that?

@tmds
Copy link
Member Author

tmds commented Mar 17, 2023

What breaks if we do that?

It affects formatting: non-breaking spaces don't break to the next line.

It breaks tests that expect the space to be used on Linux in the en-US time format, like on Windows.

I was wondering about something.

using System.Text;

foreach (string s in new string[] { "'\u202F'", "'\u00A0'" })
{
    Console.WriteLine(s);
    Console.WriteLine(AsciiRoundTrip(s));
}

static string AsciiRoundTrip(string s) =>
    Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s));

This prints:

' '
'?'
' '
'?'

It seems neither of these non-breaking spaces get converted to regular spaces when converting to ASCII.
If that's the intended behavior for the ASCII conversion(?), it seems desired to not include non-breaking spaces for ASCII compatibility.

@tmds
Copy link
Member Author

tmds commented Mar 17, 2023

Based on the above, I assume that, for ASCII conversion purposes, we want regular spaces in the time format.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 17, 2023
@Clockwork-Muse
Copy link
Contributor

If that's the intended behavior for the ASCII conversion(?), it seems desired to not include non-breaking spaces for ASCII compatibility.

Yeah, but that's ASCII. Format strings can have any number of non-ASCII characters (eg, the year/month/day symbols in Japanese), so doing it at the ICU ingest step seems wrong.

@tmds
Copy link
Member Author

tmds commented Mar 17, 2023

so doing it at the ICU ingest step seems wrong.

I agree.
I would also prefer we use the format strings as they come from icu.
That is the correct choice.

The non-breaking space case was already there.
The rationale isn't mentioned in the code. It's my guess it's for using a space that exists in ASCII.
Adding another type of non-breaking space is the safe choice as it minimizes breaking anyone.

Format strings can have any number of non-ASCII characters

Any such characters are currently filtered out in ConvertIcuTimeFormatString.

@stephentoub
Copy link
Member

I would also prefer we use the format strings as they come from icu. That is the correct choice.

You're suggesting we have two completely different parsing / formatting code paths based on whether the format string came from a user or from ICU? What about places where these format strings are returned to the user and need to be the .NET format? I don't see how this is a viable or correct choice. Have I totally misunderstood?

@tmds
Copy link
Member Author

tmds commented Mar 17, 2023

This function takes the format string that comes from icu, and translate it in to the equivalent .NET format string.

@stephentoub
Copy link
Member

Right. It sounded like you were saying you didn't want that done at all. I probably misunderstood... what did you mean by "I would also prefer we use the format strings as they come from icu. That is the correct choice."

@tmds
Copy link
Member Author

tmds commented Mar 17, 2023

The function (ConvertIcuTimeFormatString) is currently replacing the non-breaking spaces with regular spaces, and skipping unknown characters. It could pass them through.

@stephentoub
Copy link
Member

The function (ConvertIcuTimeFormatString) is currently replacing the non-breaking spaces with regular spaces, and skipping unknown characters. It could pass them through.

Right, I understand.

Is that the extent of what you meant? That for anything not understood it should just be passed through, rather than not translating anything? I understood your comment to mean you wished zero translation was being done.

@tmds
Copy link
Member Author

tmds commented Mar 17, 2023

Is that the extent of what you meant?

Yes.

And I agree with @Clockwork-Muse that this is the correct thing.

For the same reason the non-breaking space case exist already, we may want to add an additional case (mapping 'U+202F' to ' '), and avoid making a backwards incompatible change.

@tarekgh
Copy link
Member

tarekgh commented Mar 17, 2023

we may want to add an additional case (mapping 'U+202F' to ' '),

This is wrong. CLDR intentionally added 0+202f to prevent the formatted text from wrapping in the middle. We shouldn't change the space. IIRC we already read 'U+202F' in the number formats too.

The change here is not only regarding the format, but we also want to ensure parsing should handle that too.

@tmds
Copy link
Member Author

tmds commented Mar 17, 2023

@tarekgh we should also remove the existing case for U+00A0 then?

@tarekgh
Copy link
Member

tarekgh commented Mar 17, 2023

CLDR is already changing the usage of U+00A0 to U+202F so this is not going to be important whether we change it or not. The important thing here is we want to ensure the parsing is tolerant for all these space variations.

@tarekgh tarekgh added this to the 8.0.0 milestone Mar 17, 2023
@tarekgh
Copy link
Member

tarekgh commented Mar 17, 2023

Yeah, but that's ASCII. Format strings can have any number of non-ASCII characters (eg, the year/month/day symbols in Japanese),

The case here is time format and not having any date formats.

@tmds
Copy link
Member Author

tmds commented Mar 17, 2023

With my question I actually meant to ask: it was not necessary to map U+00A0 to a regular space?

@tarekgh
Copy link
Member

tarekgh commented Mar 17, 2023

it was not necessary to map U+00A0 to a regular space?

It was helpful to avoid the parsing problem, but this was not correct to convert it to regular space.

I see you have opened a PR which I think need some more work. Are you willing to help get the correct fix?

Just for reference, ICU had a similar issue too. https://unicode-org.atlassian.net/browse/ICU-20067

@tarekgh
Copy link
Member

tarekgh commented Mar 17, 2023

Here we handle these spaces in the number parsing

private static bool IsSpaceReplacingChar(char c) => c == '\u00a0' || c == '\u202f';

@Clockwork-Muse
Copy link
Contributor

The case here is time format and not having any date formats.

Okay, fine, you can also have non-ASCII hour/minute/second characters in Japanese too.

@tarekgh
Copy link
Member

tarekgh commented Mar 17, 2023

I don't think ICU has such formats, otherwise we would see it.

@Clockwork-Muse
Copy link
Contributor

.... you appear to be right, they don't seem to be doing localized time formats (other than AM/PM markers, which would be substituted in the format string, so not the same thing). I wonder why, when they have localized date formats.

@tmds
Copy link
Member Author

tmds commented Mar 20, 2023

I see you have opened a PR which I think need some more work. Are you willing to help get the correct fix?

Yes, I will update the PR this week based on this discussion.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 4, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants