Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a collator implementation under experimental/ #1706

Merged
merged 32 commits into from
May 20, 2022

Conversation

hsivonen
Copy link
Member

@hsivonen hsivonen commented Mar 17, 2022

I'd like to land the current collator work in progress under experimental/, so that it would participate in provider refactorings without having to catch up on a branch.

What works

  • &str (guaranteed-valid UTF-8), &[u8] (potentially-invalid UTF-8), and &[u16]` (potentially-invalid UTF-16) comparison.
  • Passes CollationTest_CLDR_NON_IGNORABLE.txt and CollationTest_CLDR_SHIFTED.txt
    • Tests with unpaired surrogates are skipped, because this implementation treats UTF errors as U+FFFD instead of trying to treat errors as having collation semantics.
  • Loading collation tailorings and default settings (excluding reordering) by locale without alias resolution.
  • Passes a bunch of language-specific tests ported from ICU4C.
  • Discovering the available locales and variants by listing the TOML files in a given input directory.
  • Backward second level (Canadian French)
  • Identical (fifth) level.
  • Script reordering.
  • Loading a collation by variant.
  • Lithuanian diacritics.

What doesn't work

  • Byte-wise skipping over an identical prefix. (This is an optimization that's out of scope for now.)
  • Jamo in search collations. (When designing this, I mistakenly assumed that the jamo CE32s would be self-contained. They aren't in search tailorings. Instead, they refer to expansion data. Since the Latin mini expansion bit space is unused, it could probably be repurposed for archaic jamo mini expansions.)
  • Provider capabilities
    • Resolving locales that are aliases. (This includes lack of support for even resolving en as und.)
    • Falling back to less specific locale (e.g. from fi-FI to fi, from fi-u-co-bogus to fi, from sv-u-co-bogus to equivalent of sv-u-co-reformed, from zh-u-co-bogus to equilavent of zh-u-co-pinyin.)

Post-landing help would be particularly helpful on

  • Consolidating the near-identical providers.
  • Deciding whether script reordering should be overridable via the API or only implied by the locale. (Per meeting: No.)
  • Deciding how variants are requested, whether "search" usage should be a separate flag, and, if so, what to do with searchlj. (Per meeting: Via -u-co- in the Locale.)
  • Designing the alias handling mechanism. It needs to be able to expand region/script to variant. Most notably, zh-HK, zh-MO, zh-TW, zh-Hant, zh-Hant-HK, zh-Hant-MO, zh-Hant-TW to zh with variant stroke.

(I've edited this text to reflect the code updates.)

@hsivonen hsivonen requested review from sffc and a team as code owners March 17, 2022 14:14
@hsivonen hsivonen requested a review from echeran March 28, 2022 11:15
provider/testdata/Cargo.toml Outdated Show resolved Hide resolved
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • components/properties/src/lib.rs is different
  • experimental/char16trie/src/char16trie.rs is different
  • experimental/collator/Cargo.toml is different
  • experimental/collator/src/lib.rs is different
  • experimental/normalizer/Cargo.toml is different
  • experimental/normalizer/src/lib.rs is different
  • provider/datagen/Cargo.toml is now changed in the branch
  • provider/datagen/src/bin/datagen.rs is now changed in the branch
  • provider/datagen/src/lib.rs is now changed in the branch
  • provider/datagen/src/uprops/canonical_decompositions.rs is now changed in the branch
  • provider/datagen/src/uprops/codepointtrie.rs is now changed in the branch
  • provider/datagen/src/uprops/collation_data.rs is now changed in the branch
  • provider/datagen/src/uprops/collation_diacritics.rs is now changed in the branch
  • provider/datagen/src/uprops/collation_jamo.rs is now changed in the branch
  • provider/datagen/src/uprops/collation_metadata.rs is now changed in the branch
  • provider/datagen/src/uprops/collation_serde.rs is now changed in the branch
  • provider/datagen/src/uprops/decompositions_serde.rs is now changed in the branch
  • provider/datagen/src/uprops/enum_codepointtrie.rs is now changed in the branch
  • provider/datagen/src/uprops/mod.rs is now changed in the branch
  • provider/datagen/src/uprops/uprops_helpers.rs is now changed in the branch
  • provider/testdata/Cargo.toml is different
  • provider/testdata/data/testdata.postcard is different
  • provider/uprops/Cargo.toml is no longer changed in the branch
  • provider/uprops/src/canonical_decompositions.rs is no longer changed in the branch
  • provider/uprops/src/codepointtrie.rs is no longer changed in the branch
  • provider/uprops/src/collation_data.rs is no longer changed in the branch
  • provider/uprops/src/collation_diacritics.rs is no longer changed in the branch
  • provider/uprops/src/collation_jamo.rs is no longer changed in the branch
  • provider/uprops/src/collation_metadata.rs is no longer changed in the branch
  • provider/uprops/src/collation_serde.rs is no longer changed in the branch
  • provider/uprops/src/decompositions_serde.rs is no longer changed in the branch
  • provider/uprops/src/enum_codepointtrie.rs is no longer changed in the branch
  • provider/uprops/src/lib.rs is no longer changed in the branch
  • provider/uprops/src/uprops_helpers.rs is no longer changed in the branch
  • provider/uprops/src/uprops_serde.rs is no longer changed in the branch
  • tools/datagen/Cargo.toml is no longer changed in the branch
  • tools/datagen/src/bin/datagen.rs is no longer changed in the branch
  • tools/datagen/src/lib.rs is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on datagen/provider stuff

Also, I cannot comment this in the file because it's too big:
nit: move experimental/collator/src/CollationTest_*.txt to a testdata directory

provider/datagen/Cargo.toml Outdated Show resolved Hide resolved
provider/datagen/src/bin/datagen.rs Outdated Show resolved Hide resolved
experimental/collator/src/provider.rs Outdated Show resolved Hide resolved
experimental/collator/src/provider.rs Outdated Show resolved Hide resolved
experimental/datatest/Cargo.toml Outdated Show resolved Hide resolved
experimental/normalizer/Cargo.toml Outdated Show resolved Hide resolved
provider/testdata/src/paths.rs Outdated Show resolved Hide resolved
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • components/properties/src/lib.rs is different
  • experimental/char16trie/src/char16trie.rs is different
  • experimental/collator/Cargo.toml is different
  • experimental/collator/README.md is different
  • experimental/collator/src/lib.rs is different
  • experimental/collator/src/provider.rs is different
  • experimental/datatest/Cargo.lock is no longer changed in the branch
  • experimental/datatest/Cargo.toml is no longer changed in the branch
  • experimental/datatest/src/main.rs is no longer changed in the branch
  • experimental/normalizer/Cargo.toml is different
  • experimental/normalizer/src/lib.rs is different
  • experimental/normalizer/src/provider.rs is different
  • provider/datagen/Cargo.toml is different
  • provider/datagen/src/bin/datagen.rs is different
  • provider/datagen/src/collator/mod.rs is different
  • provider/datagen/src/collator/transform.rs is different
  • provider/datagen/src/lib.rs is different
  • provider/datagen/src/uprops/canonical_decompositions.rs is different
  • provider/datagen/src/uprops/enum_codepointtrie.rs is different
  • provider/datagen/src/uprops/mod.rs is different
  • provider/testdata/Cargo.toml is different
  • provider/testdata/data/json/collator/data@1/big5han/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/data@1/gb2312han/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/data@1/search/sv.json is no longer changed in the branch
  • provider/testdata/data/json/collator/data@1/standard/sv.json is no longer changed in the branch
  • provider/testdata/data/json/collator/data@1/stroke/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/data@1/sv-u-co-search.json is now changed in the branch
  • provider/testdata/data/json/collator/data@1/sv-u-co-standard.json is now changed in the branch
  • provider/testdata/data/json/collator/data@1/unihan/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/data@1/zh-u-co-big5han.json is now changed in the branch
  • provider/testdata/data/json/collator/data@1/zh-u-co-gb2312.json is now changed in the branch
  • provider/testdata/data/json/collator/data@1/zh-u-co-stroke.json is now changed in the branch
  • provider/testdata/data/json/collator/data@1/zh-u-co-unihan.json is now changed in the branch
  • provider/testdata/data/json/collator/data@1/zh-u-co-zhuyin.json is now changed in the branch
  • provider/testdata/data/json/collator/data@1/zhuyin/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/dia@1/traditional/vi.json is no longer changed in the branch
  • provider/testdata/data/json/collator/dia@1/vi-u-co-trad.json is now changed in the branch
  • provider/testdata/data/json/collator/meta@1/big5han/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/meta@1/gb2312han/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/meta@1/search/sv.json is no longer changed in the branch
  • provider/testdata/data/json/collator/meta@1/standard/sv.json is no longer changed in the branch
  • provider/testdata/data/json/collator/meta@1/stroke/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/meta@1/sv-u-co-search.json is now changed in the branch
  • provider/testdata/data/json/collator/meta@1/sv-u-co-standard.json is now changed in the branch
  • provider/testdata/data/json/collator/meta@1/unihan/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/meta@1/zh-u-co-big5han.json is now changed in the branch
  • provider/testdata/data/json/collator/meta@1/zh-u-co-gb2312.json is now changed in the branch
  • provider/testdata/data/json/collator/meta@1/zh-u-co-stroke.json is now changed in the branch
  • provider/testdata/data/json/collator/meta@1/zh-u-co-unihan.json is now changed in the branch
  • provider/testdata/data/json/collator/meta@1/zh-u-co-zhuyin.json is now changed in the branch
  • provider/testdata/data/json/collator/meta@1/zhuyin/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/reord@1/big5han/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/reord@1/gb2312han/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/reord@1/stroke/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/reord@1/unihan/zh.json is no longer changed in the branch
  • provider/testdata/data/json/collator/reord@1/zh-u-co-big5han.json is now changed in the branch
  • provider/testdata/data/json/collator/reord@1/zh-u-co-gb2312.json is now changed in the branch
  • provider/testdata/data/json/collator/reord@1/zh-u-co-stroke.json is now changed in the branch
  • provider/testdata/data/json/collator/reord@1/zh-u-co-unihan.json is now changed in the branch
  • provider/testdata/data/json/collator/reord@1/zh-u-co-zhuyin.json is now changed in the branch
  • provider/testdata/data/json/collator/reord@1/zhuyin/zh.json is no longer changed in the branch
  • provider/testdata/data/json/normalizer/nfd@1.json is no longer changed in the branch
  • provider/testdata/data/json/normalizer/nfd@1/und.json is now changed in the branch
  • provider/testdata/data/testdata.postcard is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • provider/datagen/src/collator/transform.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
experimental/normalizer/src/lib.rs Outdated Show resolved Hide resolved
experimental/normalizer/src/lib.rs Outdated Show resolved Hide resolved
experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
provider/datagen/src/uprops/mod.rs Outdated Show resolved Hide resolved
experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to stick to Conventional Comments, but I forgot that all of these non-optional comments need to be accompanied with the proper Github review flags being set.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • experimental/collator/src/elements.rs is different
  • provider/datagen/src/collator/transform.rs is different
  • provider/datagen/src/uprops/codepointtrie.rs is no longer changed in the branch
  • provider/datagen/src/uprops/decompositions_serde.rs is different
  • provider/datagen/src/uprops/mod.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had written some comments but forgot to send them, they might be outdated.

experimental/collator/fuzz/Cargo.toml Outdated Show resolved Hide resolved
experimental/collator/src/provider.rs Outdated Show resolved Hide resolved
experimental/collator/src/provider.rs Show resolved Hide resolved
experimental/normalizer/Cargo.toml Outdated Show resolved Hide resolved
experimental/normalizer/fuzz/Cargo.toml Outdated Show resolved Hide resolved
experimental/normalizer/src/provider.rs Outdated Show resolved Hide resolved
provider/datagen/src/collator/transform.rs Outdated Show resolved Hide resolved
provider/datagen/src/uprops/canonical_decompositions.rs Outdated Show resolved Hide resolved
provider/testdata/src/paths.rs Outdated Show resolved Hide resolved
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • components/properties/src/lib.rs is different
  • experimental/collator/Cargo.toml is different
  • experimental/collator/src/comparison.rs is different
  • experimental/normalizer/Cargo.toml is different
  • experimental/normalizer/src/lib.rs is different
  • provider/datagen/Cargo.toml is different
  • provider/datagen/src/bin/datagen.rs is no longer changed in the branch
  • provider/datagen/src/collator/mod.rs is no longer changed in the branch
  • provider/datagen/src/collator/transform.rs is no longer changed in the branch
  • provider/datagen/src/lib.rs is no longer changed in the branch
  • provider/datagen/src/registry.rs is now changed in the branch
  • provider/datagen/src/source.rs is now changed in the branch
  • provider/datagen/src/transform/collator/mod.rs is now changed in the branch
  • provider/datagen/src/transform/collator/transform.rs is now changed in the branch
  • provider/datagen/src/transform/mod.rs is now changed in the branch
  • provider/datagen/src/transform/uprops/canonical_decompositions.rs is now changed in the branch
  • provider/datagen/src/transform/uprops/decompositions_serde.rs is now changed in the branch
  • provider/datagen/src/transform/uprops/mod.rs is now changed in the branch
  • provider/datagen/src/uprops/canonical_decompositions.rs is no longer changed in the branch
  • provider/datagen/src/uprops/decompositions_serde.rs is no longer changed in the branch
  • provider/datagen/src/uprops/enum_codepointtrie.rs is no longer changed in the branch
  • provider/datagen/src/uprops/mod.rs is no longer changed in the branch
  • provider/testdata/Cargo.toml is different
  • provider/testdata/data/testdata.postcard is different
  • provider/testdata/src/bin/datagen.rs is now changed in the branch
  • provider/testdata/src/paths.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • provider/datagen/Cargo.toml is different
  • provider/datagen/src/source.rs is different
  • provider/testdata/data/testdata.postcard is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for everything outside experimental; I haven't looked at the actual algorithms in-depth.

experimental/collator/src/elements.rs Outdated Show resolved Hide resolved
experimental/collator/src/lib.rs Outdated Show resolved Hide resolved
robertbastian
robertbastian previously approved these changes May 18, 2022
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait for Elango's review before submitting

Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A couple of minor things, but feel free to merge this in now, and create a followup PR later.

experimental/collator/src/options.rs Show resolved Hide resolved
experimental/collator/src/lib.rs Show resolved Hide resolved
experimental/collator/Cargo.toml Show resolved Hide resolved
experimental/collator/src/comparison.rs Show resolved Hide resolved
experimental/collator/src/lib.rs Show resolved Hide resolved
experimental/collator/src/options.rs Show resolved Hide resolved
experimental/collator/src/provider.rs Show resolved Hide resolved
experimental/collator/src/provider.rs Show resolved Hide resolved
experimental/collator/src/provider.rs Show resolved Hide resolved
experimental/normalizer/src/lib.rs Show resolved Hide resolved
experimental/normalizer/src/lib.rs Show resolved Hide resolved
@sffc
Copy link
Member

sffc commented May 20, 2022

@hsivonen Feel free to hit the merge button, and start opening smaller PRs to resolve remaining open comments. Please make sure all the open comments are associated with an open issue.

Great work!

@hsivonen hsivonen merged commit e40ba55 into unicode-org:main May 20, 2022
@hsivonen
Copy link
Member Author

Thanks. Merged. I will make sure issues are filed for the comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants