Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update .NET 5 Unicode data to version 13.0.0 #33538

Merged
merged 4 commits into from
Mar 15, 2020

Conversation

GrabYourPitchforks
Copy link
Member

Fixes #2378. See that issue for the steps taken to generate these files.

Note to reviewers: This PR is marked as NO MERGE because it's based on top of #33511. Once that PR is committed I can rebase this on top of master, remove the label, and commit. Ignore the changes in the eng/ directory since they ultimately won't be part of this PR. The rest of the PR is ready for review.

@MichalStrehovsky I added you since this PR touches unicodedata.cpp, which you introduced. I ran the tool in that directory against the latest UnicodeData.txt file to regenerate this file's contents. Feel free to review commit 12ef246 in isolation.

@ericstj I added you since I updated the third party copyrights file at the repo root to point to Unicode's new license URL and wanted to make sure everything was ok. Feel free to review commit f9dc373 in isolation.

@GrabYourPitchforks GrabYourPitchforks added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) area-System.Globalization labels Mar 13, 2020
@GrabYourPitchforks GrabYourPitchforks added this to the 5.0 milestone Mar 13, 2020
@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 13, 2020

The following is an incomplete list of types which are affected by this PR (list taken from #2378) since they ultimately rely on the underlying Unicode data.

  • System.Globalization.StringInfo
  • System.Globalization.CharUnicodeInfo
  • System.Text.Encodings.Web.*
  • System.Text.Json.* (since it depends on System.Text.Encodings.Web)

Other types hich call into the above (examples: System.Char, System.Uri, System.Text.Rune, System.RegularExpressions.Regex) will also see the new data plumbed through.

See http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html for more information on the changes made to Unicode 13.0. Note that since Unicode 13.0 adds no new blocks to the Basic Multilingual Plane, there are no public API changes required to the existing System.Text.Unicode.UnicodeRanges type.

For letter characters which were introduced into existing blocks in the Basic Multilingual Plane (e.g., U+31BD BOPOMOFO LETTER KW), the UnsafeRelaxedJavaScriptEncoder will now detect these as valid characters and allow them to pass through unescaped.

@tarekgh
Copy link
Member

tarekgh commented Mar 13, 2020

LGTM. any idea how much increase in the data size? just curious :-)

@GrabYourPitchforks
Copy link
Member Author

GenUnicodeProp run output follows. Looks like around a 320 byte increase, give or take some padding?

Unicode 12.1 UCD

CategoryCasingMap contains 56 entries.
NumericGraphemeMap contains 177 entries.

Process 11:5:4 table CategoryCasingTable.
level 1: 2176 [ 2176]
level 2:   97 [ 6208]*
level 3:  690 [11040]
Total:         19424

Process 11:5:4 table NumericGraphemeTable.
level 1: 2176 [ 2176]
level 2:   76 [ 4864]*
level 3:  378 [ 6048]
Total:         13088

Unicode 13.0 UCD

CategoryCasingMap contains 56 entries.
NumericGraphemeMap contains 177 entries.

Process 11:5:4 table CategoryCasingTable.
level 1: 2176 [ 2176]
level 2:   98 [ 6272]*
level 3:  699 [11184]
Total:         19632

Process 11:5:4 table NumericGraphemeTable.
level 1: 2176 [ 2176]
level 2:   77 [ 4928]*
level 3:  381 [ 6096]
Total:         13200

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky I added you since this PR touches unicodedata.cpp, which you introduced

Whoa, that brings back some repressed memories. LGTM.

@stephentoub
Copy link
Member

@GrabYourPitchforks, should I feel good or bad that no tests had to be modified anywhere?

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 13, 2020

@stephentoub I verified that updating the runtime caused unit tests to fail until the test .csproj files were also updated to reference v13.0. So the unit test projects were updated, just not the unit test code. :)

(The unit tests in System.Text.Encodings.Web, System.Runtime, and System.Globalization all parse the Unicode files themselves and generate the appropriate test cases on-the-fly, validating that the runtime has the expected behavior.)

@stephentoub
Copy link
Member

Thanks, understood. What I meant was, you didn't have to change any tests, which means there aren't any tests directly expecting certain values that may have changed here. I'm wondering if we're happy about that or sad about that.

@GrabYourPitchforks
Copy link
Member Author

Got confirmation offline that the third party license file changes are ok.

@@ -464,6 +464,7 @@ CONST UnicodeDataRec UnicodeData[] = {
{ 0x275, LOWER_CASE, 0x19F },
{ 0x27D, LOWER_CASE, 0x2C64 },
{ 0x280, LOWER_CASE, 0x1A6 },
{ 0x282, LOWER_CASE, 0xA7C5 },
Copy link
Member

@am11 am11 Mar 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichalStrehovsky, could this be a header-only or does adding .cpp in addition to .h file give some advantage? I realize that it is an auto-generated code file, UnicodeData[] can still can be packed in the header (i.e. .h file can be auto-generated with some glued structs which are currently declared there).
just wondering about your thoughts on .cpp vs. header-only approach in this case. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage of header-only besides having one less file?

I generally prefer .h/.cpp split because long time ago when I did a lot of C++, precompiled headers were a PITA to deal with and from observing where C++ is heading with modules and all, people still didn't figure it out. This is a big data structure to re-parse every time the file is included. I now try to stay away from C++ as much as possible so I might not be up to date.

@GrabYourPitchforks GrabYourPitchforks added the blocked Issue/PR is blocked on something - see comments label Mar 14, 2020
@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 14, 2020

Force-pushing with a rebase atop b22719b. No code changes since the initial PR other than the rebase.

@GrabYourPitchforks GrabYourPitchforks removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) blocked Issue/PR is blocked on something - see comments labels Mar 14, 2020
@GrabYourPitchforks
Copy link
Member Author

/azp run runtime

@dotnet dotnet deleted a comment from azure-pipelines bot Mar 15, 2020
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@GrabYourPitchforks GrabYourPitchforks merged commit 30fd787 into dotnet:master Mar 15, 2020
@GrabYourPitchforks GrabYourPitchforks deleted the unicode_13 branch March 15, 2020 06:53
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update .NET 5 Unicode data to version 13.0.0
6 participants