Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize the icu_casemap component #3234

Closed
31 tasks done
sffc opened this issue Mar 30, 2023 · 28 comments · Fixed by #3843
Closed
31 tasks done

Stabilize the icu_casemap component #3234

sffc opened this issue Mar 30, 2023 · 28 comments · Fixed by #3843
Assignees
Labels
C-unicode Component: Props, sets, tries help wanted Issue needs an assignee S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Mar 30, 2023

This issue tracks the work to release icu_casemap as a stable component.

Checklist (not exhaustive)

@sffc sffc added help wanted Issue needs an assignee T-core Type: Required functionality C-unicode Component: Props, sets, tries S-large Size: A few weeks (larger feature, major refactoring) labels Mar 30, 2023
@sffc sffc added this to the Priority Backlog ⟨P3⟩ milestone Mar 30, 2023
@Manishearth
Copy link
Member

Added a checklist, please append more items to it

@Manishearth
Copy link
Member

Manishearth commented Jun 15, 2023

[ ] Move data struct validation into deserialize to allow validation-free databake

@robertbastian i'm not convinced we should be doing the heavy validation in serde: it's debug-assertions only, and the struct has GIGO behavior if you give it bad data. I think this is the right call, since there's a lot of work in properly validating this data otherwise.

There are a couple places that are currently relying on validate that I need to GIGO

@robertbastian
Copy link
Member

I don't care, as long as we don't do any validation in databake and I can make the constructor const

@Manishearth
Copy link
Member

There shouldn't be right now, if there is you are welcome to remove it

@Manishearth
Copy link
Member

Useful note for later: the unfold data currently in use in icu4x

[("aʾ", "ẚ"), ("ff", "ff"), ("ffi", "ffi"), ("ffl", "ffl"), ("fi", "fi"), ("fl", "fl"), ("h\u{331}", "ẖ"), ("i\u{307}", "İ"), ("j\u{30c}", "ǰ"), ("ss", "ßẞ"), ("st", "ſtst"), ("t\u{308}", "ẗ"), ("w\u{30a}", "ẘ"), ("y\u{30a}", "ẙ"), ("ʼn", "ʼn"), ("άι", "ᾴ"), ("ήι", "ῄ"), (\u{342}", "ᾶ"), (\u{342}ι", "ᾷ"), ("αι", "ᾳᾼ"), (\u{342}", "ῆ"), (\u{342}ι", "ῇ"), ("ηι", "ῃῌ"), (\u{308}\u{300}", "ῒ"), (\u{308}\u{301}", "ΐΐ"), (\u{308}\u{342}", "ῗ"), (\u{342}", "ῖ"), (\u{313}", "ῤ"), (\u{308}\u{300}", "ῢ"), (\u{308}\u{301}", "ΰΰ"), (\u{308}\u{342}", "ῧ"), (\u{313}", "ὐ"), (\u{313}\u{300}", "ὒ"), (\u{313}\u{301}", "ὔ"), (\u{313}\u{342}", "ὖ"), (\u{342}", "ῦ"), (\u{342}", "ῶ"), (\u{342}ι", "ῷ"), ("ωι", "ῳῼ"), ("ώι", "ῴ"), ("եւ", "և"), ("մե", "ﬔ"), ("մի", "ﬕ"), ("մխ", "ﬗ"), ("մն", "ﬓ"), ("վն", "ﬖ"), ("ἀι", "ᾀᾈ"), ("ἁι", "ᾁᾉ"), ("ἂι", "ᾂᾊ"), ("ἃι", "ᾃᾋ"), ("ἄι", "ᾄᾌ"), ("ἅι", "ᾅᾍ"), ("ἆι", "ᾆᾎ"), ("ἇι", "ᾇᾏ"), ("ἠι", "ᾐᾘ"), ("ἡι", "ᾑᾙ"), ("ἢι", "ᾒᾚ"), ("ἣι", "ᾓᾛ"), ("ἤι", "ᾔᾜ"), ("ἥι", "ᾕᾝ"), ("ἦι", "ᾖᾞ"), ("ἧι", "ᾗᾟ"), ("ὠι", "ᾠᾨ"), ("ὡι", "ᾡᾩ"), ("ὢι", "ᾢᾪ"), ("ὣι", "ᾣᾫ"), ("ὤι", "ᾤᾬ"), ("ὥι", "ᾥᾭ"), ("ὦι", "ᾦᾮ"), ("ὧι", "ᾧᾯ"), ("ὰι", "ᾲ"), ("ὴι", "ῂ"), ("ὼι", "ῲ")]

@Manishearth
Copy link
Member

Should #3552 be a stabilization blocker? Feels like we can count it as a known bug.

@Manishearth
Copy link
Member

Manishearth commented Jun 21, 2023

Also, ICU4C supports str.toTitle(locale, options). Should we as well? Currently we do not support options, we can add them as an API later (to_upper_with_options() or something)

Also str.caseCompare(str2), whcih also takes options

@Manishearth
Copy link
Member

The toTitle() functions need a break iterator: we need to add APIs which:

  • Instantiate a default word break iterator
  • Allow passing in a word break iterator

I also think this can be designed in a later release.

@Manishearth
Copy link
Member

question (@sffc , @robertbastian ): Do the current function names look good to you?

  • to_uppercase(char) -> char
  • to_full_uppercase(&str) -> Writeable
  • to_full_uppercase_string(&str) -> String

I somewhat feel like the stringy one should be the one with the shorter name, and the char one should be to_uppercase_char()

@Manishearth
Copy link
Member

Manishearth commented Jun 21, 2023

A change we should make is move the locale from the constructor to the methods. We can make them all take CaseMappingLocale and have people call .into() or Default::default(). We could also give them _with_locale() versions but that might lead to a large proliferation.

And how should it look over FFI (where locales are not free to instantiate). I guess we can expose that enum.

Thoughts?

@eggrobin
Copy link
Member

Do the current function names look good to you?

to_uppercase(char) -> char
to_full_uppercase(&str) -> Writeable
to_full_uppercase_string(&str) -> String
I somewhat feel like the stringy one should be the one with the shorter name, and the char one should be to_uppercase_char()

The simple mappings and foldings (which you should practically never use unless you have specific compatibility requirements) having shorter and more default-looking names than the default ones seems like a bad idea. I would favour renaming all of the char-to-char functions to have simple in the name, this is a well-established term in this context.

A change we should make is move the locale from the constructor to the methods.

Something like that would be a good idea. It is very weird that the case foldings look like they depend on the locale, which they do not and must not.

@sffc
Copy link
Member Author

sffc commented Jun 21, 2023

Maybe to_uppercase(&str) -> Writeable, to_uppercase_string(&str) -> String, to_uppercase_char(char) -> char ?

What is your idea for CaseMappingLocale and how does it differ from Locale or DataLocale? Keep in mind that we're still pending the rearchitecting of ICU4X's Locale/Preferences handling so I might not want to deviate too far from the existing types given that they might change again in the near future.

@robertbastian
Copy link
Member

We might actually want to accept W: Writeable like list formatting does already.

@Manishearth
Copy link
Member

What is your idea for CaseMappingLocale and how does it differ from Locale or DataLocale? Keep in mind that we're still pending the rearchitecting of ICU4X's Locale/Preferences handling so I might not want to deviate too far from the existing types given that they might change again in the near future.

It's what's already used internally, it's a simple enum:

pub enum CaseMapLocale {
    Root,
    Turkish,
    Lithuanian,
    Greek,
    Dutch,
    Armenian,
}

and it basically covers the different casemapping special-case modes. ICU4C's API consumes a Locale; but it does feel potentially faster to not require a conversion each time, and exposing something that is From<Locale> seems fine.

This way we only require the actual subpart of the locale the algorithm cares about, instead of having clients treat it opaquely.

@Manishearth
Copy link
Member

Copying over Markus' feedback from the ICU4X team meeting notes

  • What's the difference between try_new and try_new_with_locale?
  • Give examples for all the functions, especially to_titlecase(char)->char (dz -> Dz)
  • Do we support specialcasing.txt?
  • There are some things that are only in ICU and not in data. Example: Greek uppercasing, which drops most but not all accents, but with certain exceptions. It's not representable in data, so it is coded manually.
  • Long term we want the Edits API. Clients need it like Chrome.
  • People will want full titlecase, but it requires segmentation. Maybe make it pluggable with a trait. A default implementation could be to titlecase only the first character. When titlecasing, sometimes you want to leave interior characters along, and sometimes you want to convert them to lowercase.
  • Take a look at the ICU4J CaseMap API!
  • In ICU, the locale-special data are stored in the same trie as root data, so we can specify the locale later.
  • For case folding, the locale doesn't matter, only turkic and non-turkic. Right now it's confusing because it looks like the locale could influence case folding.

I do think I'm going to go along the path of doing titlecase with a segmentation trait.

@sffc
Copy link
Member Author

sffc commented Jun 21, 2023

If I had to choose between full-string titlecase and Greek uppercase, I think Greek uppercase is more important since it is about i18n correctness.

The Locale thing reminds me a lot of what happens with Collation and Segmentation tailorings.

@Manishearth
Copy link
Member

Manishearth commented Jun 21, 2023

I've copied the actionable bits of Markus' feedback, as well as stuff discussed here, into the issue above.

I'm not sure if it's either-or: full-string titlecase isn't that tricky, whereas Greek uppercase seems to involve reimplementing half of the uppercasing algorithm, without any spec for reference. It's a lot more work.

@Manishearth
Copy link
Member

Discussions to have:

  • General API shape (Rust and FFI) for locale specific stuff
  • Greek uppercasing thing
  • Titlecasing design

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 26, 2023
@sffc
Copy link
Member Author

sffc commented Jun 27, 2023

Discuss with:

@Manishearth
Copy link
Member

Discussion:

CaseMapLocale:

  • @robertbastian Probably should take a locale on the methods (as opposed to a custom enum)
  • @sffc Agree, I consider it a language hint
  • @eggrobin Note that folding should not be treated as locale-dependent, "turkic casefolding" should be considered to be a different operation
  • @eggrobin pointing out that the spec has four casefolding operations: toCaseFold, NFD ∘ toCaseFold ∘ NFD, NFKD ∘ toCaseFold ∘ NFD, toNFKC_Casefold ∘ NFD
  • Should we accept Language or Locale
  • @Manishearth: note that in the future someone may decide they want de-u-cm-ß
  • If we do an enum (or opaque struct wrapping an enum) it shuld be called CaseMapLanguageHint
  • Decision: LanguageIdentifier, Locale over FFI
  • Decision: No to_uppercase_root()

Titlecasing:

  • Don't expose full segmentation based API. Expose titlecase_segment(&str, &langid) -> String / Writeable that does not handle any segmentation
  • Add more if someone asks for it

Simple case mapping:

  • Methods called to_simple_lowercase(char) -> char.

Greek uppercasing:

  • @sffc Should not treat ICU4C parity as a goal
  • @Manishearth and @eggrobin: "i18n correct" is not feasible without tons of research
  • @sffc iff this is going to be coming to CLDR/some other body, we should wait for it. If not, we should block our release.
  • @Manishearth and @robertbastian do not disagree
  • @eggrobin We could take the position that we do best-effort ICU4C parity
  • Decision: Defer until PAG meeting, based on response decide to block release or not. Potentially implement anyway.

Agreed: @Manishearth @sffc @robertbastian @eggrobin

@Manishearth
Copy link
Member

Manishearth commented Jul 15, 2023

Graduation checklist

  • The crate should fully conform with the ICU4X Style Guide:
    • The crate should have a complete library header as shown in boilerplate.md with Clippy passing
    • The names of exported types should conform to the recommendations in the Style Guide
    • All Error types should be Copy
    • All constructors should take options structs by value
    • All constructors should take arguments in the following order: Provider, Locale, Options
    • All constructors should have the standard set of overloads for provider types
    • Runtime dependencies should be minimal and able to be disabled with Cargo features if possible
    • Any TODO, FIXME, todo!, unimplemented!, or other placeholders should either be resolved or link to an issue number. (It is okay to ship a small amount of code with tech debt comments, but anything having to do with code correctness should be resolved)
  • The crate should have a conventional Cargo.toml file:
    • Cargo.toml should use license, not license-file
    • The description should be useful
    • Correct repo link, authors, categories, include
    • Correct docs.rs and cargo-all-features metadata settings
    • A rust-version field with the MSRV of this crate in accordance with the current ICU4X policies on MSRV.
    • The crate should have an std feature if (and only if) it contains code that depends on std, such as file I/O or implementing the Error trait
    • Audit all features so that any foo/bar in a feature is foo?/bar when foo is an optional dep; if you intend to enable foo, add two entries
    • Use dep: for enabling dependencies
  • The crate should be fully documented
    • Every exported function should have docs coverage
    • There should be a crate-level example that illustrates a common use case for the component with the heading # Examples
    • All options and conditional code paths should have a corresponding docs test with the heading # Examples
    • All functions that are conditional on a Cargo feature should say so (last line before # Examples): ✨ *Enabled with the `alloc` Cargo feature.*
    • Compiled data constructors should say "with compiled data" in the first sentence and should have a Cargo feature alert following the above syntax.
  • The data structs should fully follow ZeroVec style
    • Deserialization should not have a "net violation" in the verify-zero-copy test
    • Constructors should avoid allocating memory in the common case
    • Opaque blobs of data should be avoided if possible (instead use VarZeroVec, ZeroMap, etc.)
    • Data structs should not be panicky to load/deserialize and conform to data_safety.md
  • The component should be fully integrated with ICU4X tooling
    • There should be an overview Criterion benchmark
    • There should be individual Criterion benchmarks for interesting or performance critical code paths
    • There should be at least one example plumbed with the icu_benchmark_macros
    • The component should be fully covered by FFI with no unjustified suppressions nor entries in missing_apis.txt
  • The component should encourage i18n best practices
    • The component should produce correct results in all locales as generated by icu4x-datagen
    • Where applicable, the component should be consistent with ECMA-402 and UTS Initialize the minimal set of components. #35
    • Any gaps in i18n quality should be fixed, or, if that is not possible, they should have tracking issues and a concrete, resourced path forward. The intent is to not ship components with known i18n correctness problems and no plan to fix them in an upcoming release
      • One known gap is irish casefolding, there is an upstream issue to figure that out CLDR-13985
    • The API design should receive sign-off from a non-ICU4X i18n expert such as Markus Scherer

@Manishearth
Copy link
Member

Manishearth commented Jul 15, 2023

A bunch of the ticked-off items above are fixed in #3689

@sffc I think there are a couple entries in the checklist that may benefit from a 15 minute pair discussion where we verify it together. Specifically: the style guide naming part, i18n correctness, and data struct design

@Manishearth
Copy link
Member

Only remaining stabilization blocker is Shane (or someone else) and I should go through everything.

(and then move the folder)

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Jul 20, 2023
@Manishearth
Copy link
Member

Manishearth commented Jul 21, 2023

  • debug_assert in Exceptions datagen when doing GIGO

@sffc
Copy link
Member Author

sffc commented Aug 8, 2023

I added the new checkboxes from #3693 to the comment above.

A few things I notice:

  • You have # Example instead of # Examples (I know, minor silly thing that I don't necessarily agree with, but this is the time to fix it)
  • I don't see examples for TailCasing or HeadAdjustment; it looks like you cover them in the docs for TitlecaseMapper itself, which is fine, but please add a link.
  • This check box is not checked yet: "There should be at least one example plumbed with the icu_benchmark_macros"

@Manishearth
Copy link
Member

This check box is not checked yet: "There should be at least one example plumbed with the icu_benchmark_macros"

Yeah I was planning to do that later. But I'll just roll it into #3803

@Manishearth
Copy link
Member

I think I have handled every box on this page except for #3801 in #3803

(even if not, I would prefer to land that rather than require that PR fix everything)

@Manishearth Manishearth linked a pull request Aug 11, 2023 that will close this issue
@Manishearth
Copy link
Member

Final checkbox checked by #3843

We're done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries help wanted Issue needs an assignee S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants