Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regenerate collation test data #2090

Merged
merged 1 commit into from
Jun 21, 2022
Merged

Conversation

hsivonen
Copy link
Member

@hsivonen hsivonen commented Jun 20, 2022

(Edited.) This is an update from ICU4C 71 branch CI.

@hsivonen hsivonen requested review from a team and echeran as code owners June 20, 2022 08:35
@hsivonen hsivonen added the C-collator Component: Collation, normalization label Jun 20, 2022
@hsivonen hsivonen self-assigned this Jun 20, 2022
@hsivonen
Copy link
Member Author

I have checked that both the full versions of the conformance tests and the disabled-in-repo zh, lt, fi, and sv tests pass with this data update if enabled.

@sffc
Copy link
Member

sffc commented Jun 20, 2022

@echeran Note that this PR also effectively updates the property testdata to Unicode 15.

@echeran
Copy link
Contributor

echeran commented Jun 20, 2022

@echeran Note that this PR also effectively updates the property testdata to Unicode 15.

Good observation. The only property that is being updated is Canonical_Combining_Class, which should be collator-specific. If we merge this PR as-is, then we have one property in v1.0 on Unicode v15.0, and the rest of the properties on v14.0.

On the one hand, the change does seem isolated, but on the other hand, we would have a version inconsistency among the properties in ICU4X v1.0 (because updating properties to Unicode v15.0 is slated for after the ICU4X v1.0 release).

I have no other reservations about this data update PR except the aforementioned version consistency / release timing.

Any thoughts? @hsivonen -- is this a PR that can wait until after the imminent ICU4X v1.0 release (even if you're away and need someone else to click the merge button)?

@hsivonen
Copy link
Member Author

When I made this PR, I thought that ICU4X 1.0 was supposed to ship with Unicode 15 support.

To the extent #2058 is blocked on the ICU4C patch and to the extent we want the collator and normalizer as non-experimental in 1.0, if the ICU4C patch lands on ICU4C trunk only, the full data export from ICU4C CI will be from ICU4C trunk, i.e. Unicode 15. This PR is about test data only either way. If we want Unicode 14-consistent non-test data to be available for the normalizer and the collator, the ICU4C patch needs to be backported to ICU4C from before ICU4C trunk upgraded to Unicode 15.

In that sense, I think leaving this test data patch unmerged doesn't address what non-test data is available as a zip file for non-test use. And on the flip side, to the extent the test data is really just test data, it seems OK to update it in a less synchronized way and the main thing is that the deployment zip file is either all-Unicode 14 or all-Unicode 15.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • experimental/collator/testdata/CollationTest_CLDR_NON_IGNORABLE.txt is no longer changed in the branch
  • experimental/collator/testdata/CollationTest_CLDR_SHIFTED.txt is no longer changed in the branch
  • experimental/collator/testdata/README.md is no longer changed in the branch
  • provider/testdata/data/baked/collator/data_v1.rs is different
  • provider/testdata/data/baked/collator/jamo_v1.rs is no longer changed in the branch
  • provider/testdata/data/baked/collator/prim_v1.rs is no longer changed in the branch
  • provider/testdata/data/baked/normalizer/nfc_v1.rs is no longer changed in the branch
  • provider/testdata/data/baked/normalizer/nfd_v1.rs is no longer changed in the branch
  • provider/testdata/data/baked/normalizer/nfkc_v1.rs is no longer changed in the branch
  • provider/testdata/data/baked/normalizer/nfkd_v1.rs is no longer changed in the branch
  • provider/testdata/data/baked/normalizer/uts46_v1.rs is no longer changed in the branch
  • provider/testdata/data/baked/normalizer/uts46d_v1.rs is no longer changed in the branch
  • provider/testdata/data/baked/props/ccc_v1.rs is no longer changed in the branch
  • provider/testdata/data/coll/ja_standard_data.toml is no longer changed in the branch
  • provider/testdata/data/coll/root_standard_data.toml is different
  • provider/testdata/data/coll/root_standard_jamo.toml is no longer changed in the branch
  • provider/testdata/data/coll/root_standard_prim.toml is no longer changed in the branch
  • provider/testdata/data/coll/th_standard_data.toml is no longer changed in the branch
  • provider/testdata/data/coll/zh_big5han_data.toml is no longer changed in the branch
  • provider/testdata/data/coll/zh_gb2312han_data.toml is no longer changed in the branch
  • provider/testdata/data/coll/zh_pinyin_data.toml is no longer changed in the branch
  • provider/testdata/data/coll/zh_stroke_data.toml is no longer changed in the branch
  • provider/testdata/data/coll/zh_unihan_data.toml is no longer changed in the branch
  • provider/testdata/data/coll/zh_zhuyin_data.toml is no longer changed in the branch
  • provider/testdata/data/json/collator/data@1/ja.json is no longer changed in the branch
  • provider/testdata/data/json/collator/data@1/th.json is no longer changed in the branch
  • provider/testdata/data/json/collator/data@1/und.json is different
  • provider/testdata/data/json/collator/jamo@1/und.json is no longer changed in the branch
  • provider/testdata/data/json/collator/prim@1/und.json is no longer changed in the branch
  • provider/testdata/data/json/normalizer/nfc@1/und.json is no longer changed in the branch
  • provider/testdata/data/json/normalizer/nfd@1/und.json is no longer changed in the branch
  • provider/testdata/data/json/normalizer/nfkc@1/und.json is no longer changed in the branch
  • provider/testdata/data/json/normalizer/nfkd@1/und.json is no longer changed in the branch
  • provider/testdata/data/json/normalizer/uts46@1/und.json is no longer changed in the branch
  • provider/testdata/data/json/normalizer/uts46d@1/und.json is no longer changed in the branch
  • provider/testdata/data/json/props/ccc@1/und.json is no longer changed in the branch
  • provider/testdata/data/testdata.postcard is different
  • provider/testdata/data/uprops/icuexportdata_uprops_full/small/ccc.toml is different
  • provider/testdata/data/uprops/icuexportdata_uprops_full/small/nfc.toml is different
  • provider/testdata/data/uprops/icuexportdata_uprops_full/small/nfd.toml is different
  • provider/testdata/data/uprops/icuexportdata_uprops_full/small/nfkc.toml is different
  • provider/testdata/data/uprops/icuexportdata_uprops_full/small/nfkd.toml is different
  • provider/testdata/data/uprops/icuexportdata_uprops_full/small/uts46.toml is different
  • provider/testdata/data/uprops/icuexportdata_uprops_full/small/uts46d.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@hsivonen
Copy link
Member Author

Somehow I had managed to be off by three months in my understanding of when Unicode 15 is expected to be released and how it relates to ICU4X 1.0. I have force-pushed a much smaller changeset that syncs the collation and normalization relevant files from an actual (draft PR) CI run from the ICU4C 71 branch.

The non-comment change relates to my tweaks to the root generation using the genuca tool.

Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that SGTM, and it resolves my only issue about property data version consistency. Thanks, everything LGTM.

@hsivonen hsivonen merged commit 02702d3 into unicode-org:main Jun 21, 2022
@hsivonen hsivonen deleted the regencoll branch June 21, 2022 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants