Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add numerical ordering option for string comparison operations #109861

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

PranavSenthilnathan
Copy link
Member

@PranavSenthilnathan PranavSenthilnathan commented Nov 15, 2024

Adds numerical ordering for comparison operations (e.g. Compare, Equals, GetHashCode, GetSortKey). This now enables comparisons of numbers based on their numerical value instead of lexicographical order, such as 2 < 10. We don't support Index operations (e.g. StartWith, EndsWith, IsPrefix, IsSuffix) since the underlying globalization libraries, NLS and ICU, don't support it.

Because this new option relies on underlying globalization libraries, there could be differences in behavior for different platforms and libraries:

  • Full ICU (default in most cases):
    • Numbers with leading zeros are treated as equal to equivalent numbers with no leading zeros ("01" == "1").
    • Equivalent numbers in different languages/scripts are treated the same ("1" == "١", where ١ is the Arabic-Indic Digit One).
  • NLS (legacy Windows):
    • Numbers with leading zeros are treated as unequal to equivalent numbers with no leading zeros ("01" != "1"). However, these numbers are compared as expected with unequal numbers, namely 1 < 02 < 2 < 03.
    • Equivalent numbers in different languages/scripts are treated as unequal ("1" != "١")
  • Hybrid ICU:
    • WASM: The same as ICU, but the option set is limited. This PR does not change the valid option set besides allowing adding the NumericOrdering option to any previously valid option. See this for the allowed options.
    • iOS, tvos, maccatalyst: Not supported because I don't have an Apple device to test with. However, Apple does support numeric comparisons for strings.

Implements #13979.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@PranavSenthilnathan PranavSenthilnathan changed the title [WIP] Numerical ordering for string compare/equals/hashcode Add numerical ordering option for string comparison operations Nov 18, 2024
@PranavSenthilnathan PranavSenthilnathan marked this pull request as ready for review November 18, 2024 22:14
/// Indicates that the string comparison must sort sequences of digits (Unicode general category "Nd") based on their numeric value.
/// For example, "2" comes before "10". Non-digit characters such as decimal points, minus or plus signs, etc.
/// are not considered as part of the sequence and will terminate it.
/// </summary>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

///

will be good to add a remarks to this one giving more information like this option can be used in comparisons but not for search (IndexOf/StartsWith/EndsWith). Will be good to hint the behavior difference too when ICU is used against NLS. And last tell this option cannot be combined with the ordinal operations,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was getting long so I just included just the part about indexing. Mentioning search might be confusing because people might consider IndexOf (which isn't supported) as search and GetHashCode (which is supported) as not. I think it's easier to just say that NumericOrdering works in all cases except for indexing.

I prefer to keep the combination behavior with Ordinal and OrdinalIgnoreCase on those members since this is really a property of theirs instead of numeric ordering. This is also consistent with the other options as well which don't mention Ordinal even though they can't combine with them either.

I think the ICU and NLS differences should probably go in docs rather than in doc comments since NLS usage is not going to be high. There are already docs about NLS/ICU differences that we can append to (https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu#behavioral-differences).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you prefer that, it will be better to edit the docs of the indexing APIs and add the remark there. I am trying to make it easy for the API users to understand when this new enum value is not allowed. I guess users can be puzzled if they get exceptions and do not understand what is wrong.

You don't have to block the PR on that, but it will be good to open a doc issue/PR to add the info as needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. I just checked and we already list the valid options so I will update those docs once this goes in:
Compare
IndexOf

@@ -9186,6 +9186,7 @@ public enum CompareOptions
IgnoreSymbols = 4,
IgnoreKanaType = 8,
IgnoreWidth = 16,
NumericOrdering = 32,
OrdinalIgnoreCase = 268435456,
StringSort = 536870912,
Ordinal = 1073741824,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we change this to use hex numbers for the sake of the readability?

Copy link
Member Author

@PranavSenthilnathan PranavSenthilnathan Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the GenAPI tool generates it and the guidance I've heard is to make as few diffs from that tool as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but still can make diffs :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ref assembly isn’t really designed for readability, it’s designed to be correct and automatically generated

Deviations and manual diffs just cause later downstream pain and hinder the ability to rerun the tool, as people have to fight against it

The better option would be to submit a bug or better a patch such that the output produced by the tool for flags enabled enums is “better” such as using hex or logical shifts to represent the bits instead (1 << 0, 1 << 1, 1 << 2, etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created an issue: dotnet/sdk#44999. GenAPI seems to be in a kind of limbo state where there is a new Roslyn based version we want to switch to (tracked by dotnet/sdk#31088) and we don't want to maintain the current CCI-based one (https://github.com/dotnet/arcade/blob/main/src/Microsoft.DotNet.GenAPI/README.md).

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor comments, LGTM otherwise. Thanks @PranavSenthilnathan!

I'll let @matouskozak and @ilonatommy comment on the hybrid and WASM stuff.

@matouskozak
Copy link
Member

/azp run runtime-ioslike,runtime-ioslikesimulator,runtime-maccatalyst

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link
Member

@matouskozak matouskozak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for considering the Apple mobile globalization :). Could you please create an issue for the missing support on iOS/..., linking the original issue and the findings that you mentioned in the PR description. I'll look into it later.

I think we might have to temporarily disable the newly added tests for Apple mobile otherwise we will start getting PlatformNotSupportedException for the numeric ordering tests. I started the Apple mobile CI test runs to see if that's the case.

One more thing, in the PR description, there is an example for NLS

However, these numbers are compared as expected with unequal numbers, namely 1 < 02 < 2 < 03.

Does that mean that numbers with leading zeros are always smaller than without. Also, 002 < 02 then?

@PranavSenthilnathan
Copy link
Member Author

/azp run runtime-ioslike,runtime-ioslikesimulator,runtime-maccatalyst

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@PranavSenthilnathan
Copy link
Member Author

Thank you for considering the Apple mobile globalization :). Could you please create an issue for the missing support on iOS/..., linking the original issue and the findings that you mentioned in the PR description. I'll look into it later.

Created #109999

I think we might have to temporarily disable the newly added tests for Apple mobile otherwise we will start getting PlatformNotSupportedException for the numeric ordering tests. I started the Apple mobile CI test runs to see if that's the case.

Updated the PR to skip them. I introduced a new PlatformDetection property IsNumericComparisonSupported instead of reusing the IsHybridGlobalizationOnApplePlatform so it's easier to find and remove once when you get the test working.

One more thing, in the PR description, there is an example for NLS

However, these numbers are compared as expected with unequal numbers, namely 1 < 02 < 2 < 03.

Does that mean that numbers with leading zeros are always smaller than without. Also, 002 < 02 then?

Yes, if the numbers are actually equal, then in NLS the one with more leading zeros is considered less (it's basically a tiebreaker). This behavior is better for deterministic sorting of lists, but the downside is that hash tables won't consider these equal. I prefer ICU's behavior here (JS/wasm does the same) but I don't think there's much we can do about it.

@PranavSenthilnathan
Copy link
Member Author

/azp run runtime-ioslike,runtime-ioslikesimulator,runtime-maccatalyst

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link
Member

@matouskozak matouskozak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the test fixes. I checked the apple mobile CI and everything failing looks to be unrelated and tracked at #103472. Looking good from Apple mobile side.

@ilonatommy do you want to check the WASM changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants