[browser] `HybridGlobalization` correct `HashCode` ranges of skipped unicodes #97351

ilonatommy · 2024-01-22T22:40:30Z

Background:

In #96354 we introduced a mechanism of calculating HashCodes for invariant culture and non-invariant culture with CompareOptions.None and CompareOptions.IgnoreCase. In order to make the invariant HashCode function be in line with JS-equal function: localeCompare we are skipping some unicodes. The ranges used in the original PR were collected using ConsoleApp on Windows which turned out not to be a correct approach - they were NLS-based ranges.
For browser (v8-based browsers list is same as Firefox) the list of skipped unicodes is shorter (1826 instead of ~16k).

Reason for this PR:

The bigger range does not include the whole corrected range.

Skipped codes by UnicodeCategory:

Control: 65 (out of 1105)
SpaceSeparator: 17 (out of 289)
OtherPunctuation: 628 (out of 7004)
OpenPunctuation: 79 (out of 1343)
ClosePunctuation: 77 (out of 1309)
DashPunctuation: 26 (out of 425)
ConnectorPunctuation: 10 (out of 170)
InitialQuotePunctuation: 12 (out of 204)
Format: 170 (out of 731)
FinalQuotePunctuation: 10 (out of 170)
NonSpacingMark: 708 (out of 18105)
EnclosingMark: 5 (out of 221)) // 0488, 0489, A670, A671, A672
ModifierLetter: 3 (out of 4012)
OtherLetter: 2 (out of 784142)
SpacingCombiningMark: 12 (out of 4420)
LineSeparator: 1 (out of 17)
ParagraphSeparator: 1 (out of 17)

Performance changes:

// ICU - HybridGlobalization switched off
String, String HashCode None: 10.1400ms
String, String HashCode IgnoreCase: 10.1412ms

// HybridGlobalization before this PR (incorrect skipped range)
String, String HashCode None: 36.7683ms
String, String HashCode IgnoreCase: 62.3191ms

// HybridGlobalization after this PR (correct skip ranges)
String, String HashCode None: 36.9939ms
String, String HashCode IgnoreCase: 50.0000ms

ToDo:

Add tests that go through all Unicodes, check which are the "skippable", then check if they produce same hashCodes for two equal strings, one of which has the char appended. If they are not "skippable" they might but do not have to produce different hashCodes (this PR overskipps).

Answers to possible questions:

Q: Why don't we do one loop only, for skipping and changing case, so that IgnoreCase is not slower than None?
A: ToUpper is localized, so we make a call to JS. If we would call it on each char of the string, we would do n-times call to JS and back. This way we send the full string only once.

Q: Then why don't we move the hashing logic to JS?
A: For that we would need to re-implement the algorithm that is already available in C#. It would be a code duplicate. What's more, all the hashing with None option would need to call to JS, causing even bigger delay.

ghost · 2024-01-22T22:40:36Z

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Background:
In #96354 we introduced a mechanism of calculating HashCodes for invariant culture and non-invariant culture with CompareOptions.None and CompareOptions.IgnoreCase. In order to make the invariant HashCode function work the same way as ICU4C hashing, we are skipping some unicodes. The ranges used in the original PR were collected using ConsoleApp on Windows which turned out not to be a correct approach - they were NLS-based ranges.
For WASM the list of skipped unicodes is shorter (5219 instead of ~16k).

Reason for this PR:
The bigger range does not include the whole corrected range.

Skipped codes by UnicodeCategory:

Control: 59 (out of 1105)
Format: 43 (out of 731)
NonSpacingMark: 195 (out of 18105)
EnclosingMark: 5 (out of 221) // 0488, 0489, A670, A671, A672
ModifierLetter: 2 (out of 4012) // 0640, 07FA
SpacingCombiningMark: 4 (out of 4420) // 0F3E, 0F3F, 1CE1, 1CF7
OtherPunctuation: 4 (out of 7004) // 180A, 1CD3, not one char, two : 10F86 (\uD803 \uDC00), 10F87 (\uD803 \uDF87)
OtherLetter: 683 (out of 784142)
OtherNotAssigned: 3581 (out of 24718)
UppercaseLetter: 4 (out of 19159) // 10591 (\uD801\uDC91), 10592 (\uD801\uDC92), 10594 (\uD801\uDC94), 10595 (\uD801\uDC95)
LowercaseLetter: 24 (out of 24565) // 10597 -  105AF Elbasan script characters
OtherNumber: 1 (out of 5100) // 10FC6 (\uD843 \uDFC6), also 2 char unicode
PrivateUse: 614 (out of 108800)

We could skip full categories, producing more collisions. However:

doing so for letters: UppercaseLetter and LowercaseLetter means no hashing for natural language words - skipping just the required ranges is not problematic
doing so for big categories with only 1-5 chars that should be skipped looks like an overkill.

Performance changes:

// ICU - HybridGlobalization switched off
String, String HashCode None: 10.1400ms
String, String HashCode IgnoreCase: 10.1412ms

// HybridGlobalization before this PR (incorrect skipped range)
String, String HashCode None: 36.7683ms
String, String HashCode IgnoreCase: 62.3191ms

// HybridGlobalization after this PR (correct skip ranges)
String, String HashCode None:  40.1014ms
String, String HashCode IgnoreCase:  66.8652ms

Author:	ilonatommy
Assignees:	ilonatommy
Labels:	`arch-wasm`, `area-System.Globalization`
Milestone:	-

src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.WebAssembly.cs

pavelsavara · 2024-01-25T10:25:44Z

...ies/System.Runtime/tests/System.Globalization.Tests/CompareInfo/CompareInfoTests.HashCode.cs

+                for(int codePoint = 0; codePoint < 0x10FFFF; codePoint++)
+                {
+                    char character = (char)codePoint;
+					string str2 = $"a{character}b";


just silly idea: is it possible that codepoints are skipped only when are before or after another specific code point ?

I would expect only surrogates to work this way. This cast might not work for surrogates, though (they are 2 chars, not one). I need to check it, thanks

pavelsavara · 2024-01-25T10:37:53Z

...ies/System.Runtime/tests/System.Globalization.Tests/CompareInfo/CompareInfoTests.HashCode.cs

+        // Hybrid has Equal function from JS and hashing from managed invariant algorithm, they might start diverging at some point
+        [ConditionalTheory(typeof(PlatformDetection), nameof(PlatformDetection.IsHybridGlobalizationOnBrowser))]
+        [MemberData(nameof(CharsIgnoredByEqualFunction))]
+        public void CheckHashingOfSkippedChars(int hashCode1, string str2, CompareInfo cmpInfo, CompareOptions options)


I think it would be good to also have another test like

foreach locale foreach codepoint var s1=$"A{codepoint}B" var s2=$"AB" var h1 = locale.getHash(s1) var h2 = locale.getHash(s2) if(locale.equals(s1, s2)) assert(h1 == h2) else // We know that the hash collisions are OK, when they are rare. So this should fail in very small % of cases. assert(h1 != h2)

After we learned how bad the hash collisions are, we could comment out assert(h1 != h2) or add few known collisions as exception to the rules.

ghost · 2024-02-24T11:01:08Z

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

dotnet-policy-service · 2024-04-01T22:14:40Z

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

dotnet-policy-service · 2024-05-02T11:40:47Z

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

pavelsavara · 2024-05-06T11:10:37Z

@ilonatommy we know that the code in main branch is wrong. Could you please finish this or create open issue describing the problem ?

ilonatommy added 2 commits January 22, 2024 22:26

Add perf measurements.

c100191

Correct ranges.

4a4d284

ilonatommy added arch-wasm WebAssembly architecture area-System.Globalization labels Jan 22, 2024

ilonatommy requested review from matouskozak and mkhamoyan January 22, 2024 22:40

ilonatommy self-assigned this Jan 22, 2024

ilonatommy requested review from lewing and pavelsavara as code owners January 22, 2024 22:40

build-analysis bot mentioned this pull request Jan 23, 2024

Tracking issue for "WORKLOAD TIMED OUT" #90309

Closed

matouskozak reviewed Jan 23, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.WebAssembly.cs Outdated Show resolved Hide resolved

ilonatommy requested a review from matouskozak January 23, 2024 11:08

Feedback.

bb39070

matouskozak approved these changes Jan 23, 2024

View reviewed changes

ilonatommy marked this pull request as draft January 23, 2024 11:50

build-analysis bot mentioned this pull request Jan 23, 2024

Tests crashing in CI with no dump: exit code 137 means SIGKILL Killed #97049

Closed

Apply v8 / spidermonkey skippable unicodes.

da34518

This was referenced Jan 24, 2024

Tracking issue for CI build timeouts #76454

Closed

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

ilonatommy added 2 commits January 25, 2024 10:11

Slower version of tests (one icall for each test case).

5da154d

Merge branch 'main' into correct-hashcode-skip-ranges

029c6fb

pavelsavara reviewed Jan 25, 2024

View reviewed changes

ghost closed this Feb 24, 2024

pavelsavara reopened this Mar 2, 2024

dotnet-policy-service bot closed this Apr 1, 2024

ilonatommy reopened this Apr 2, 2024

dotnet-policy-service bot closed this May 2, 2024

ilonatommy mentioned this pull request May 6, 2024

[browser] HybridGlobalization correct HashCode ranges of skipped unicodes #101912

Closed

ilonatommy mentioned this pull request May 31, 2024

[browser] HybridGlobalization correct HashCode ranges of skipped unicodes #102912

Merged

github-actions bot locked and limited conversation to collaborators Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[browser] `HybridGlobalization` correct `HashCode` ranges of skipped unicodes #97351

[browser] `HybridGlobalization` correct `HashCode` ranges of skipped unicodes #97351

ilonatommy commented Jan 22, 2024 •

edited

Loading

ghost commented Jan 22, 2024

pavelsavara Jan 25, 2024

ilonatommy Jan 25, 2024

pavelsavara Jan 25, 2024

ghost commented Feb 24, 2024

dotnet-policy-service bot commented Apr 1, 2024

dotnet-policy-service bot commented May 2, 2024

pavelsavara commented May 6, 2024

[browser] HybridGlobalization correct HashCode ranges of skipped unicodes #97351

[browser] HybridGlobalization correct HashCode ranges of skipped unicodes #97351

Conversation

ilonatommy commented Jan 22, 2024 • edited Loading

Background:

Reason for this PR:

Performance changes:

ToDo:

Answers to possible questions:

ghost commented Jan 22, 2024

pavelsavara Jan 25, 2024

Choose a reason for hiding this comment

ilonatommy Jan 25, 2024

Choose a reason for hiding this comment

pavelsavara Jan 25, 2024

Choose a reason for hiding this comment

ghost commented Feb 24, 2024

dotnet-policy-service bot commented Apr 1, 2024

dotnet-policy-service bot commented May 2, 2024

pavelsavara commented May 6, 2024

[browser] `HybridGlobalization` correct `HashCode` ranges of skipped unicodes #97351

[browser] `HybridGlobalization` correct `HashCode` ranges of skipped unicodes #97351

ilonatommy commented Jan 22, 2024 •

edited

Loading