-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update .NET 5 Unicode data to version 13.0.0 #33538
Update .NET 5 Unicode data to version 13.0.0 #33538
Conversation
The following is an incomplete list of types which are affected by this PR (list taken from #2378) since they ultimately rely on the underlying Unicode data.
Other types hich call into the above (examples: See http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html for more information on the changes made to Unicode 13.0. Note that since Unicode 13.0 adds no new blocks to the Basic Multilingual Plane, there are no public API changes required to the existing For letter characters which were introduced into existing blocks in the Basic Multilingual Plane (e.g., |
LGTM. any idea how much increase in the data size? just curious :-) |
GenUnicodeProp run output follows. Looks like around a 320 byte increase, give or take some padding? Unicode 12.1 UCDCategoryCasingMap contains 56 entries.
NumericGraphemeMap contains 177 entries.
Process 11:5:4 table CategoryCasingTable.
level 1: 2176 [ 2176]
level 2: 97 [ 6208]*
level 3: 690 [11040]
Total: 19424
Process 11:5:4 table NumericGraphemeTable.
level 1: 2176 [ 2176]
level 2: 76 [ 4864]*
level 3: 378 [ 6048]
Total: 13088 Unicode 13.0 UCDCategoryCasingMap contains 56 entries.
NumericGraphemeMap contains 177 entries.
Process 11:5:4 table CategoryCasingTable.
level 1: 2176 [ 2176]
level 2: 98 [ 6272]*
level 3: 699 [11184]
Total: 19632
Process 11:5:4 table NumericGraphemeTable.
level 1: 2176 [ 2176]
level 2: 77 [ 4928]*
level 3: 381 [ 6096]
Total: 13200 |
Whoa, that brings back some repressed memories. LGTM. |
@GrabYourPitchforks, should I feel good or bad that no tests had to be modified anywhere? |
@stephentoub I verified that updating the runtime caused unit tests to fail until the test .csproj files were also updated to reference v13.0. So the unit test projects were updated, just not the unit test code. :) (The unit tests in System.Text.Encodings.Web, System.Runtime, and System.Globalization all parse the Unicode files themselves and generate the appropriate test cases on-the-fly, validating that the runtime has the expected behavior.) |
Thanks, understood. What I meant was, you didn't have to change any tests, which means there aren't any tests directly expecting certain values that may have changed here. I'm wondering if we're happy about that or sad about that. |
Got confirmation offline that the third party license file changes are ok. |
@@ -464,6 +464,7 @@ CONST UnicodeDataRec UnicodeData[] = { | |||
{ 0x275, LOWER_CASE, 0x19F }, | |||
{ 0x27D, LOWER_CASE, 0x2C64 }, | |||
{ 0x280, LOWER_CASE, 0x1A6 }, | |||
{ 0x282, LOWER_CASE, 0xA7C5 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichalStrehovsky, could this be a header-only or does adding .cpp in addition to .h file give some advantage? I realize that it is an auto-generated code file, UnicodeData[]
can still can be packed in the header (i.e. .h file can be auto-generated with some glued structs which are currently declared there).
just wondering about your thoughts on .cpp vs. header-only approach in this case. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an advantage of header-only besides having one less file?
I generally prefer .h/.cpp split because long time ago when I did a lot of C++, precompiled headers were a PITA to deal with and from observing where C++ is heading with modules and all, people still didn't figure it out. This is a big data structure to re-parse every time the file is included. I now try to stay away from C++ as much as possible so I might not be up to date.
Force-pushing with a rebase atop b22719b. No code changes since the initial PR other than the rebase. |
479b387
to
9ccbd71
Compare
/azp run runtime |
Azure Pipelines successfully started running 1 pipeline(s). |
Fixes #2378. See that issue for the steps taken to generate these files.
Note to reviewers: This PR is marked as NO MERGE because it's based on top of #33511. Once that PR is committed I can rebase this on top of master, remove the label, and commit. Ignore the changes in the eng/ directory since they ultimately won't be part of this PR. The rest of the PR is ready for review.
@MichalStrehovsky I added you since this PR touches unicodedata.cpp, which you introduced. I ran the tool in that directory against the latest UnicodeData.txt file to regenerate this file's contents. Feel free to review commit 12ef246 in isolation.
@ericstj I added you since I updated the third party copyrights file at the repo root to point to Unicode's new license URL and wanted to make sure everything was ok. Feel free to review commit f9dc373 in isolation.