Regenerate collation test data #2090

hsivonen · 2022-06-20T08:35:29Z

(Edited.) This is an update from ICU4C 71 branch CI.

hsivonen · 2022-06-20T08:37:37Z

I have checked that both the full versions of the conformance tests and the disabled-in-repo zh, lt, fi, and sv tests pass with this data update if enabled.

sffc · 2022-06-20T17:24:33Z

@echeran Note that this PR also effectively updates the property testdata to Unicode 15.

echeran · 2022-06-20T19:22:39Z

@echeran Note that this PR also effectively updates the property testdata to Unicode 15.

Good observation. The only property that is being updated is Canonical_Combining_Class, which should be collator-specific. If we merge this PR as-is, then we have one property in v1.0 on Unicode v15.0, and the rest of the properties on v14.0.

On the one hand, the change does seem isolated, but on the other hand, we would have a version inconsistency among the properties in ICU4X v1.0 (because updating properties to Unicode v15.0 is slated for after the ICU4X v1.0 release).

I have no other reservations about this data update PR except the aforementioned version consistency / release timing.

Any thoughts? @hsivonen -- is this a PR that can wait until after the imminent ICU4X v1.0 release (even if you're away and need someone else to click the merge button)?

hsivonen · 2022-06-21T05:27:32Z

When I made this PR, I thought that ICU4X 1.0 was supposed to ship with Unicode 15 support.

To the extent #2058 is blocked on the ICU4C patch and to the extent we want the collator and normalizer as non-experimental in 1.0, if the ICU4C patch lands on ICU4C trunk only, the full data export from ICU4C CI will be from ICU4C trunk, i.e. Unicode 15. This PR is about test data only either way. If we want Unicode 14-consistent non-test data to be available for the normalizer and the collator, the ICU4C patch needs to be backported to ICU4C from before ICU4C trunk upgraded to Unicode 15.

In that sense, I think leaving this test data patch unmerged doesn't address what non-test data is available as a zip file for non-test use. And on the flip side, to the extent the test data is really just test data, it seems OK to update it in a less synchronized way and the main thing is that the deployment zip file is either all-Unicode 14 or all-Unicode 15.

jira-pull-request-webhook · 2022-06-21T13:24:51Z

Notice: the branch changed across the force-push!

experimental/collator/testdata/CollationTest_CLDR_NON_IGNORABLE.txt is no longer changed in the branch
experimental/collator/testdata/CollationTest_CLDR_SHIFTED.txt is no longer changed in the branch
experimental/collator/testdata/README.md is no longer changed in the branch
provider/testdata/data/baked/collator/data_v1.rs is different
provider/testdata/data/baked/collator/jamo_v1.rs is no longer changed in the branch
provider/testdata/data/baked/collator/prim_v1.rs is no longer changed in the branch
provider/testdata/data/baked/normalizer/nfc_v1.rs is no longer changed in the branch
provider/testdata/data/baked/normalizer/nfd_v1.rs is no longer changed in the branch
provider/testdata/data/baked/normalizer/nfkc_v1.rs is no longer changed in the branch
provider/testdata/data/baked/normalizer/nfkd_v1.rs is no longer changed in the branch
provider/testdata/data/baked/normalizer/uts46_v1.rs is no longer changed in the branch
provider/testdata/data/baked/normalizer/uts46d_v1.rs is no longer changed in the branch
provider/testdata/data/baked/props/ccc_v1.rs is no longer changed in the branch
provider/testdata/data/coll/ja_standard_data.toml is no longer changed in the branch
provider/testdata/data/coll/root_standard_data.toml is different
provider/testdata/data/coll/root_standard_jamo.toml is no longer changed in the branch
provider/testdata/data/coll/root_standard_prim.toml is no longer changed in the branch
provider/testdata/data/coll/th_standard_data.toml is no longer changed in the branch
provider/testdata/data/coll/zh_big5han_data.toml is no longer changed in the branch
provider/testdata/data/coll/zh_gb2312han_data.toml is no longer changed in the branch
provider/testdata/data/coll/zh_pinyin_data.toml is no longer changed in the branch
provider/testdata/data/coll/zh_stroke_data.toml is no longer changed in the branch
provider/testdata/data/coll/zh_unihan_data.toml is no longer changed in the branch
provider/testdata/data/coll/zh_zhuyin_data.toml is no longer changed in the branch
provider/testdata/data/json/collator/data@1/ja.json is no longer changed in the branch
provider/testdata/data/json/collator/data@1/th.json is no longer changed in the branch
provider/testdata/data/json/collator/data@1/und.json is different
provider/testdata/data/json/collator/jamo@1/und.json is no longer changed in the branch
provider/testdata/data/json/collator/prim@1/und.json is no longer changed in the branch
provider/testdata/data/json/normalizer/nfc@1/und.json is no longer changed in the branch
provider/testdata/data/json/normalizer/nfd@1/und.json is no longer changed in the branch
provider/testdata/data/json/normalizer/nfkc@1/und.json is no longer changed in the branch
provider/testdata/data/json/normalizer/nfkd@1/und.json is no longer changed in the branch
provider/testdata/data/json/normalizer/uts46@1/und.json is no longer changed in the branch
provider/testdata/data/json/normalizer/uts46d@1/und.json is no longer changed in the branch
provider/testdata/data/json/props/ccc@1/und.json is no longer changed in the branch
provider/testdata/data/testdata.postcard is different
provider/testdata/data/uprops/icuexportdata_uprops_full/small/ccc.toml is different
provider/testdata/data/uprops/icuexportdata_uprops_full/small/nfc.toml is different
provider/testdata/data/uprops/icuexportdata_uprops_full/small/nfd.toml is different
provider/testdata/data/uprops/icuexportdata_uprops_full/small/nfkc.toml is different
provider/testdata/data/uprops/icuexportdata_uprops_full/small/nfkd.toml is different
provider/testdata/data/uprops/icuexportdata_uprops_full/small/uts46.toml is different
provider/testdata/data/uprops/icuexportdata_uprops_full/small/uts46d.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

hsivonen · 2022-06-21T13:29:57Z

Somehow I had managed to be off by three months in my understanding of when Unicode 15 is expected to be released and how it relates to ICU4X 1.0. I have force-pushed a much smaller changeset that syncs the collation and normalization relevant files from an actual (draft PR) CI run from the ICU4C 71 branch.

The non-comment change relates to my tweaks to the root generation using the genuca tool.

echeran

Great, that SGTM, and it resolves my only issue about property data version consistency. Thanks, everything LGTM.

hsivonen requested review from a team and echeran as code owners June 20, 2022 08:35

hsivonen added the C-collator Component: Collation, normalization label Jun 20, 2022

hsivonen self-assigned this Jun 20, 2022

Refresh collator and normalizer data from ICU4C 71 branch CI

2ef3a8e

hsivonen force-pushed the regencoll branch from fb8cd9a to 2ef3a8e Compare June 21, 2022 13:24

echeran approved these changes Jun 21, 2022

View reviewed changes

hsivonen merged commit 02702d3 into unicode-org:main Jun 21, 2022

hsivonen deleted the regencoll branch June 21, 2022 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regenerate collation test data #2090

Regenerate collation test data #2090

hsivonen commented Jun 20, 2022 •

edited

Loading

hsivonen commented Jun 20, 2022

sffc commented Jun 20, 2022

echeran commented Jun 20, 2022

hsivonen commented Jun 21, 2022

jira-pull-request-webhook bot commented Jun 21, 2022

hsivonen commented Jun 21, 2022

echeran left a comment

Regenerate collation test data #2090

Regenerate collation test data #2090

Conversation

hsivonen commented Jun 20, 2022 • edited Loading

hsivonen commented Jun 20, 2022

sffc commented Jun 20, 2022

echeran commented Jun 20, 2022

hsivonen commented Jun 21, 2022

jira-pull-request-webhook bot commented Jun 21, 2022

hsivonen commented Jun 21, 2022

echeran left a comment

Choose a reason for hiding this comment

hsivonen commented Jun 20, 2022 •

edited

Loading