Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[browser] HybridGlobalization correct HashCode ranges of skipped unicodes #97351

Closed

Conversation

ilonatommy
Copy link
Member

@ilonatommy ilonatommy commented Jan 22, 2024

Background:

In #96354 we introduced a mechanism of calculating HashCodes for invariant culture and non-invariant culture with CompareOptions.None and CompareOptions.IgnoreCase. In order to make the invariant HashCode function be in line with JS-equal function: localeCompare we are skipping some unicodes. The ranges used in the original PR were collected using ConsoleApp on Windows which turned out not to be a correct approach - they were NLS-based ranges.
For browser (v8-based browsers list is same as Firefox) the list of skipped unicodes is shorter (1826 instead of ~16k).

Reason for this PR:

The bigger range does not include the whole corrected range.

Skipped codes by UnicodeCategory:

Control: 65 (out of 1105)
SpaceSeparator: 17 (out of 289)
OtherPunctuation: 628 (out of 7004)
OpenPunctuation: 79 (out of 1343)
ClosePunctuation: 77 (out of 1309)
DashPunctuation: 26 (out of 425)
ConnectorPunctuation: 10 (out of 170)
InitialQuotePunctuation: 12 (out of 204)
Format: 170 (out of 731)
FinalQuotePunctuation: 10 (out of 170)
NonSpacingMark: 708 (out of 18105)
EnclosingMark: 5 (out of 221)) // 0488, 0489, A670, A671, A672
ModifierLetter: 3 (out of 4012)
OtherLetter: 2 (out of 784142)
SpacingCombiningMark: 12 (out of 4420)
LineSeparator: 1 (out of 17)
ParagraphSeparator: 1 (out of 17)

Performance changes:

// ICU - HybridGlobalization switched off
String, String HashCode None: 10.1400ms
String, String HashCode IgnoreCase: 10.1412ms

// HybridGlobalization before this PR (incorrect skipped range)
String, String HashCode None: 36.7683ms
String, String HashCode IgnoreCase: 62.3191ms

// HybridGlobalization after this PR (correct skip ranges)
String, String HashCode None: 36.9939ms
String, String HashCode IgnoreCase: 50.0000ms

ToDo:

  • Add tests that go through all Unicodes, check which are the "skippable", then check if they produce same hashCodes for two equal strings, one of which has the char appended. If they are not "skippable" they might but do not have to produce different hashCodes (this PR overskipps).

Answers to possible questions:

Q: Why don't we do one loop only, for skipping and changing case, so that IgnoreCase is not slower than None?
A: ToUpper is localized, so we make a call to JS. If we would call it on each char of the string, we would do n-times call to JS and back. This way we send the full string only once.

Q: Then why don't we move the hashing logic to JS?
A: For that we would need to re-implement the algorithm that is already available in C#. It would be a code duplicate. What's more, all the hashing with None option would need to call to JS, causing even bigger delay.

@ghost
Copy link

ghost commented Jan 22, 2024

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Background:
In #96354 we introduced a mechanism of calculating HashCodes for invariant culture and non-invariant culture with CompareOptions.None and CompareOptions.IgnoreCase. In order to make the invariant HashCode function work the same way as ICU4C hashing, we are skipping some unicodes. The ranges used in the original PR were collected using ConsoleApp on Windows which turned out not to be a correct approach - they were NLS-based ranges.
For WASM the list of skipped unicodes is shorter (5219 instead of ~16k).

Reason for this PR:
The bigger range does not include the whole corrected range.

Skipped codes by UnicodeCategory:

Control: 59 (out of 1105)
Format: 43 (out of 731)
NonSpacingMark: 195 (out of 18105)
EnclosingMark: 5 (out of 221) // 0488, 0489, A670, A671, A672
ModifierLetter: 2 (out of 4012) // 0640, 07FA
SpacingCombiningMark: 4 (out of 4420) // 0F3E, 0F3F, 1CE1, 1CF7
OtherPunctuation: 4 (out of 7004) // 180A, 1CD3, not one char, two : 10F86 (\uD803 \uDC00), 10F87 (\uD803 \uDF87)
OtherLetter: 683 (out of 784142)
OtherNotAssigned: 3581 (out of 24718)
UppercaseLetter: 4 (out of 19159) // 10591 (\uD801\uDC91), 10592 (\uD801\uDC92), 10594 (\uD801\uDC94), 10595 (\uD801\uDC95)
LowercaseLetter: 24 (out of 24565) // 10597 -  105AF Elbasan script characters
OtherNumber: 1 (out of 5100) // 10FC6 (\uD843 \uDFC6), also 2 char unicode
PrivateUse: 614 (out of 108800)

We could skip full categories, producing more collisions. However:

  • doing so for letters: UppercaseLetter and LowercaseLetter means no hashing for natural language words - skipping just the required ranges is not problematic
  • doing so for big categories with only 1-5 chars that should be skipped looks like an overkill.

Performance changes:

// ICU - HybridGlobalization switched off
String, String HashCode None: 10.1400ms
String, String HashCode IgnoreCase: 10.1412ms

// HybridGlobalization before this PR (incorrect skipped range)
String, String HashCode None: 36.7683ms
String, String HashCode IgnoreCase: 62.3191ms

// HybridGlobalization after this PR (correct skip ranges)
String, String HashCode None:  40.1014ms
String, String HashCode IgnoreCase:  66.8652ms
Author: ilonatommy
Assignees: ilonatommy
Labels:

arch-wasm, area-System.Globalization

Milestone: -

for(int codePoint = 0; codePoint < 0x10FFFF; codePoint++)
{
char character = (char)codePoint;
string str2 = $"a{character}b";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just silly idea: is it possible that codepoints are skipped only when are before or after another specific code point ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect only surrogates to work this way. This cast might not work for surrogates, though (they are 2 chars, not one). I need to check it, thanks

// Hybrid has Equal function from JS and hashing from managed invariant algorithm, they might start diverging at some point
[ConditionalTheory(typeof(PlatformDetection), nameof(PlatformDetection.IsHybridGlobalizationOnBrowser))]
[MemberData(nameof(CharsIgnoredByEqualFunction))]
public void CheckHashingOfSkippedChars(int hashCode1, string str2, CompareInfo cmpInfo, CompareOptions options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to also have another test like

foreach locale
    foreach codepoint
        var s1=$"A{codepoint}B"
        var s2=$"AB"
        var h1 = locale.getHash(s1)
        var h2 = locale.getHash(s2)
        if(locale.equals(s1, s2))
             assert(h1 == h2)
        else
             // We know that the hash collisions are OK, when they are rare. So this should fail in very small % of cases.
             assert(h1 != h2)

After we learned how bad the hash collisions are, we could comment out assert(h1 != h2) or add few known collisions as exception to the rules.

@ghost
Copy link

ghost commented Feb 24, 2024

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@pavelsavara
Copy link
Member

@ilonatommy we know that the code in main branch is wrong. Could you please finish this or create open issue describing the problem ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants