Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Invariant Mode Case Mapping #55520

Merged
merged 10 commits into from
Jul 15, 2021

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Jul 12, 2021

Fixes #43774

This change is to add full support for case mapping in the Invariant Globalization Mode. Casing will be supported with string and span operations (Compare, IndexOf, LastIndexOf, IsPrefix, IsSuffix, SortKey generation, calculating the hashcode, ToUpper and ToLower).

To support the casing, we must carry some casing data which will increase the System.Private.Corelib by about 6K.

Here is more info about the size increase:

Before

 File size            : 1574400
 PE header size       : 512 (472 used)    ( 0.03%)
 PE additional info   : 1152              ( 0.07%)
 Num.of PE sections   : 2
 CLR header size     : 72                 ( 0.00%)
 CLR meta-data size  : 808396             (51.35%)
 CLR additional info : 163480             (10.38%)
 CLR method headers  : 52913              ( 3.36%)
 Managed code         : 525157            (33.36%)
 Data                 : 1536              ( 0.10%)
 Unaccounted          : 21182             ( 1.35%)

 Num.of PE sections   : 2
   .text    - 1572352
   .rsrc    - 1536

After

File size            : 1580544
 PE header size       : 512 (472 used)    ( 0.03%)
 PE additional info   : 1156              ( 0.07%)
 Num.of PE sections   : 2
 CLR header size     : 72                 ( 0.00%)
 CLR meta-data size  : 810756             (51.30%)
 CLR additional info : 163760             (10.36%)
 CLR method headers  : 53141              ( 3.36%)
 Managed code         : 527276            (33.36%)
 Data                 : 1536              ( 0.10%)
 Unaccounted          : 22335             ( 1.41%)

 Num.of PE sections   : 2
   .text    - 1578496
   .rsrc    - 1536

@ghost
Copy link

ghost commented Jul 12, 2021

Tagging subscribers to this area: @tarekgh, @safern
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #43774

This change is to add full support for case mapping in the Invariant Globalization Mode. Casing will be supported withe string and span operations (Compare, IndexOf, LastIndexOf, IsPrefix, IsSuffix, SortKey generation, calculating the hashcode, ToUpper and ToLower).

To support the casing, we must carry some casing data which will increase the System.Private.Corelib by about 13K. I tried to optimize the new introduced casing data as much as I can and in same time having the casing operation still fast, so I generated the data in the form of 8-4-4 tables.

Here is more info about the size increase:

Before

 File size            : 3787264
 PE header size       : 512 (472 used)    ( 0.01%)
 PE additional info   : 1156              ( 0.03%)
 Num.of PE sections   : 2
 CLR header size     : 72                 ( 0.00%)
 CLR meta-data size  : 2171436            (57.34%)
 CLR additional info : 221152             ( 5.84%)
 CLR method headers  : 126595             ( 3.34%)
 Managed code         : 1192156           (31.48%)
 Data                 : 1536              ( 0.04%)
 Unaccounted          : 72649             ( 1.92%)

 Num.of PE sections   : 2
   .text    - 3785216
   .rsrc    - 1536

After

 File size            : 3801088
 PE header size       : 512 (472 used)    ( 0.01%)
 PE additional info   : 1156              ( 0.03%)
 Num.of PE sections   : 2
 CLR header size     : 72                 ( 0.00%)
 CLR meta-data size  : 2172268            (57.15%)
 CLR additional info : 221152             ( 5.82%)
 CLR method headers  : 126641             ( 3.33%)
 Managed code         : 1193753           (31.41%)
 Data                 : 1536              ( 0.04%)
 Unaccounted          : 83998             ( 2.21%)

 Num.of PE sections   : 2
   .text    - 3799040
   .rsrc    - 1536
Author: tarekgh
Assignees: -
Labels:

area-System.Globalization

Milestone: -

@tarekgh
Copy link
Member Author

tarekgh commented Jul 12, 2021

@marek-safar @eerhardt @jkotas @safern could you please have a look? I appreciate if I can get your feedback soon if possible, to catch the deadline.

CC @danmoseley @ericstj

@tarekgh tarekgh self-assigned this Jul 12, 2021
@tarekgh tarekgh added this to the 6.0.0 milestone Jul 12, 2021
@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Jul 12, 2021

Our existing ICU call sites special-case the Turkish I. We should consider removing that special-case or adding it here as well so that the invariant-mode casing tables match the non-invariant-mode casing tables.

Edit: looking at the bottom of the file, there appears to be some special-casing for the Latin Long S as well?

@GrabYourPitchforks
Copy link
Member

I'm glad to see this work come together! 😃

Heads up that IMO this is a substantive breaking change and deserves a callout in the .NET 6 breaking change docs. Also deserves an email blast to the internal breaking change notifications list.

@tarekgh
Copy link
Member Author

tarekgh commented Jul 12, 2021

Edit: looking at the bottom of the file, there appears to be some special-casing for the Latin Long S as well?

I already did that when generating the data in the tool.
https://github.com/tarekgh/runtime/blob/5db42df43e28fc9662e333abed369167dadd2050/src/coreclr/System.Private.CoreLib/Tools/InvariantCasing/Program.cs#L57

@tarekgh
Copy link
Member Author

tarekgh commented Jul 12, 2021

Heads up that IMO this is a substantive breaking change and deserves a callout in the .NET 6 breaking change docs. Also deserves an email blast to the internal breaking change notifications list.

Interesting as I didn't think about this way :-) Invariant mode is restrictive and I am not sure if there is anyone possibly depend on the old behavior. But I think wouldn't hurt if I file a breaking change issue. I'll do that. Thanks for pointing at that.

@tarekgh
Copy link
Member Author

tarekgh commented Jul 13, 2021

@GrabYourPitchforks your suggestion to use the Unicode category data to generate the casing tables there give us much better size result. Now the whole increase is around 6K compared to 13K before applying your suggestion. Thanks a lot.
I kept using Unicode 13 data and we can do the whole upgrade to 14 in all areas later.

@tarekgh
Copy link
Member Author

tarekgh commented Jul 13, 2021

Now everything is ready. I appreciate if you can review this soon so I can try to get it before we snap. Thanks!

tarekgh and others added 2 commits July 13, 2021 13:30
…ryCasingInfo.cs

Co-authored-by: Santiago Fernandez Madero <safern@microsoft.com>
…CharUnicodeInfo.cs

Co-authored-by: Santiago Fernandez Madero <safern@microsoft.com>
@tarekgh
Copy link
Member Author

tarekgh commented Jul 13, 2021

I have updated the breaking change issue dotnet/docs#24849 to include the case mapping changes we are doing here.

@tarekgh
Copy link
Member Author

tarekgh commented Jul 14, 2021

Any other comments or we are good to go? I hope I can get this merged by tomorrow. thanks for all feedback so far :-)

@GrabYourPitchforks
Copy link
Member

Plan on finalizing the review for this tomorrow. Thanks for the patience! :)

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tentative approval, subject to:

  • Take a look at the sorting behavior w.r.t. when one string contains a surrogate pair and the other string doesn't, and double-check that the as-implemented behavior is the behavior we want.

  • See comments in SurrogateCasing.cs and determine whether that method should check its inputs or whether the caller should be more defensive against not passing in bad data.

  • Everything else is perf-related and is at your discretion. The logic otherwise LGTM!

Thanks so much for tacking this Tarek! :)

@tarekgh
Copy link
Member Author

tarekgh commented Jul 15, 2021

Libraries Test Run release coreclr windows x86 Debug leg failure is tracked by the issue #55715

The reset of the CI runs already succeeded but looks it was not reflected in GitHub. I am merging it.

@tarekgh tarekgh merged commit 1b14c94 into dotnet:main Jul 15, 2021
@tarekgh tarekgh deleted the SupportInvariantModeCaseMapping branch July 15, 2021 06:05
@ghost ghost locked as resolved and limited conversation to collaborators Aug 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve invariant globalization mode to be more complete
7 participants