Update .NET 5 Unicode data to version 13.0.0 #33538

GrabYourPitchforks · 2020-03-13T00:28:31Z

Fixes #2378. See that issue for the steps taken to generate these files.

Note to reviewers: This PR is marked as NO MERGE because it's based on top of #33511. Once that PR is committed I can rebase this on top of master, remove the label, and commit. Ignore the changes in the eng/ directory since they ultimately won't be part of this PR. The rest of the PR is ready for review.

@MichalStrehovsky I added you since this PR touches unicodedata.cpp, which you introduced. I ran the tool in that directory against the latest UnicodeData.txt file to regenerate this file's contents. Feel free to review commit 12ef246 in isolation.

@ericstj I added you since I updated the third party copyrights file at the repo root to point to Unicode's new license URL and wanted to make sure everything was ok. Feel free to review commit f9dc373 in isolation.

GrabYourPitchforks · 2020-03-13T00:36:52Z

The following is an incomplete list of types which are affected by this PR (list taken from #2378) since they ultimately rely on the underlying Unicode data.

System.Globalization.StringInfo
System.Globalization.CharUnicodeInfo
System.Text.Encodings.Web.*
System.Text.Json.* (since it depends on System.Text.Encodings.Web)

Other types hich call into the above (examples: System.Char, System.Uri, System.Text.Rune, System.RegularExpressions.Regex) will also see the new data plumbed through.

See http://blog.unicode.org/2020/03/announcing-unicode-standard-version-130.html for more information on the changes made to Unicode 13.0. Note that since Unicode 13.0 adds no new blocks to the Basic Multilingual Plane, there are no public API changes required to the existing System.Text.Unicode.UnicodeRanges type.

For letter characters which were introduced into existing blocks in the Basic Multilingual Plane (e.g., U+31BD BOPOMOFO LETTER KW), the UnsafeRelaxedJavaScriptEncoder will now detect these as valid characters and allow them to pass through unescaped.

tarekgh · 2020-03-13T00:53:54Z

LGTM. any idea how much increase in the data size? just curious :-)

GrabYourPitchforks · 2020-03-13T01:14:37Z

GenUnicodeProp run output follows. Looks like around a 320 byte increase, give or take some padding?

Unicode 12.1 UCD

CategoryCasingMap contains 56 entries.
NumericGraphemeMap contains 177 entries.

Process 11:5:4 table CategoryCasingTable.
level 1: 2176 [ 2176]
level 2:   97 [ 6208]*
level 3:  690 [11040]
Total:         19424

Process 11:5:4 table NumericGraphemeTable.
level 1: 2176 [ 2176]
level 2:   76 [ 4864]*
level 3:  378 [ 6048]
Total:         13088

Unicode 13.0 UCD

CategoryCasingMap contains 56 entries.
NumericGraphemeMap contains 177 entries.

Process 11:5:4 table CategoryCasingTable.
level 1: 2176 [ 2176]
level 2:   98 [ 6272]*
level 3:  699 [11184]
Total:         19632

Process 11:5:4 table NumericGraphemeTable.
level 1: 2176 [ 2176]
level 2:   77 [ 4928]*
level 3:  381 [ 6096]
Total:         13200

MichalStrehovsky · 2020-03-13T11:47:47Z

@MichalStrehovsky I added you since this PR touches unicodedata.cpp, which you introduced

Whoa, that brings back some repressed memories. LGTM.

stephentoub · 2020-03-13T13:38:50Z

@GrabYourPitchforks, should I feel good or bad that no tests had to be modified anywhere?

GrabYourPitchforks · 2020-03-13T16:07:02Z

@stephentoub I verified that updating the runtime caused unit tests to fail until the test .csproj files were also updated to reference v13.0. So the unit test projects were updated, just not the unit test code. :)

(The unit tests in System.Text.Encodings.Web, System.Runtime, and System.Globalization all parse the Unicode files themselves and generate the appropriate test cases on-the-fly, validating that the runtime has the expected behavior.)

stephentoub · 2020-03-13T16:12:56Z

Thanks, understood. What I meant was, you didn't have to change any tests, which means there aren't any tests directly expecting certain values that may have changed here. I'm wondering if we're happy about that or sad about that.

GrabYourPitchforks · 2020-03-13T17:02:42Z

Got confirmation offline that the third party license file changes are ok.

src/libraries/System.Text.Encodings.Web/tools/updating-encodings.md

am11 · 2020-03-14T03:01:58Z

src/coreclr/src/pal/src/locale/unicodedata.cpp

@@ -464,6 +464,7 @@ CONST UnicodeDataRec UnicodeData[] = {
  { 0x275, LOWER_CASE, 0x19F },
  { 0x27D, LOWER_CASE, 0x2C64 },
  { 0x280, LOWER_CASE, 0x1A6 },
+  { 0x282, LOWER_CASE, 0xA7C5 },


@MichalStrehovsky, could this be a header-only or does adding .cpp in addition to .h file give some advantage? I realize that it is an auto-generated code file, UnicodeData[] can still can be packed in the header (i.e. .h file can be auto-generated with some glued structs which are currently declared there).
just wondering about your thoughts on .cpp vs. header-only approach in this case. :)

Is there an advantage of header-only besides having one less file?

I generally prefer .h/.cpp split because long time ago when I did a lot of C++, precompiled headers were a PITA to deal with and from observing where C++ is heading with modules and all, people still didn't figure it out. This is a big data structure to re-parse every time the file is included. I now try to stay away from C++ as much as possible so I might not be up to date.

GrabYourPitchforks · 2020-03-14T03:06:31Z

Force-pushing with a rebase atop b22719b. No code changes since the initial PR other than the rebase.

GrabYourPitchforks · 2020-03-15T03:59:59Z

/azp run runtime

azure-pipelines · 2020-03-15T04:00:31Z

Azure Pipelines successfully started running 1 pipeline(s).

GrabYourPitchforks added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) area-System.Globalization labels Mar 13, 2020

GrabYourPitchforks added this to the 5.0 milestone Mar 13, 2020

GrabYourPitchforks requested review from layomia, tarekgh, MichalStrehovsky and ericstj March 13, 2020 00:28

tarekgh approved these changes Mar 13, 2020

View reviewed changes

lpereira reviewed Mar 14, 2020

View reviewed changes

src/libraries/System.Text.Encodings.Web/tools/updating-encodings.md Show resolved Hide resolved

am11 reviewed Mar 14, 2020

View reviewed changes

GrabYourPitchforks added the blocked Issue/PR is blocked on something - see comments label Mar 14, 2020

GrabYourPitchforks and others added 4 commits March 14, 2020 11:12

Update Unicode license file

6603e86

Update PAL helpers to Unicode 13.0

cc8d639

Update libraries to Unicode 13.0

215d75b

Update minimum required framework to run S.T.E.W tools

9ccbd71

GrabYourPitchforks removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) blocked Issue/PR is blocked on something - see comments labels Mar 14, 2020

GrabYourPitchforks force-pushed the unicode_13 branch from 479b387 to 9ccbd71 Compare March 14, 2020 18:15

dotnet deleted a comment from azure-pipelines bot Mar 15, 2020

GrabYourPitchforks merged commit 30fd787 into dotnet:master Mar 15, 2020

GrabYourPitchforks deleted the unicode_13 branch March 15, 2020 06:53

svick mentioned this pull request May 15, 2020

Update Unicode version for .Net Core dotnet/dotnet-api-docs#4242

Open

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update .NET 5 Unicode data to version 13.0.0 #33538

Update .NET 5 Unicode data to version 13.0.0 #33538

GrabYourPitchforks commented Mar 13, 2020

GrabYourPitchforks commented Mar 13, 2020 •

edited

Loading

tarekgh commented Mar 13, 2020

GrabYourPitchforks commented Mar 13, 2020

MichalStrehovsky commented Mar 13, 2020

stephentoub commented Mar 13, 2020

GrabYourPitchforks commented Mar 13, 2020 •

edited

Loading

stephentoub commented Mar 13, 2020

GrabYourPitchforks commented Mar 13, 2020

am11 Mar 14, 2020 •

edited

Loading

MichalStrehovsky Mar 16, 2020

GrabYourPitchforks commented Mar 14, 2020 •

edited

Loading

GrabYourPitchforks commented Mar 15, 2020

azure-pipelines bot commented Mar 15, 2020

Update .NET 5 Unicode data to version 13.0.0 #33538

Update .NET 5 Unicode data to version 13.0.0 #33538

Conversation

GrabYourPitchforks commented Mar 13, 2020

GrabYourPitchforks commented Mar 13, 2020 • edited Loading

tarekgh commented Mar 13, 2020

GrabYourPitchforks commented Mar 13, 2020

Unicode 12.1 UCD

Unicode 13.0 UCD

MichalStrehovsky commented Mar 13, 2020

stephentoub commented Mar 13, 2020

GrabYourPitchforks commented Mar 13, 2020 • edited Loading

stephentoub commented Mar 13, 2020

GrabYourPitchforks commented Mar 13, 2020

am11 Mar 14, 2020 • edited Loading

Choose a reason for hiding this comment

MichalStrehovsky Mar 16, 2020

Choose a reason for hiding this comment

GrabYourPitchforks commented Mar 14, 2020 • edited Loading

GrabYourPitchforks commented Mar 15, 2020

azure-pipelines bot commented Mar 15, 2020

GrabYourPitchforks commented Mar 13, 2020 •

edited

Loading

GrabYourPitchforks commented Mar 13, 2020 •

edited

Loading

am11 Mar 14, 2020 •

edited

Loading

GrabYourPitchforks commented Mar 14, 2020 •

edited

Loading