Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move CanonicalCombiningClass type out of icu_properties #5121

Open
hsivonen opened this issue Jun 25, 2024 · 10 comments · May be fixed by #5551
Open

Move CanonicalCombiningClass type out of icu_properties #5121

hsivonen opened this issue Jun 25, 2024 · 10 comments · May be fixed by #5551
Assignees
Labels
C-collator Component: Collation, normalization C-unicode Component: Props, sets, tries S-small Size: One afternoon (small bug fix or enhancement)

Comments

@hsivonen
Copy link
Member

The data for CanonicalCombiningClass is available both via icu_properties and via icu_normalizer. To have the same type in both cases, icu_normalizer depends on icu_properties. This prevents the two from being compiled in parallel. See servo/rust-url#939

We should break the dependency. The obvious thing to do is to move the type to icu_normalizer and not make the data available via icu_properties at all. If we really wanted to have two ways of providing the data, the type should go into a third crate, but there isn't a candidate destination that wouldn't be weird.

@hsivonen hsivonen added discuss Discuss at a future ICU4X-SC meeting C-unicode Component: Props, sets, tries C-collator Component: Collation, normalization labels Jun 25, 2024
@hsivonen hsivonen added this to the ICU4X 2.0 ⟨P1⟩ milestone Jun 25, 2024
@sffc
Copy link
Member

sffc commented Jun 25, 2024

Hmm, optimizing for crates compiling in parallel isn't something I think we've looked at closely.

@hsivonen
Copy link
Member Author

...and this is harder than it first appears, since we support named access of property data for the benefit of regular expression engines, and canonical combining class is one of the properties for which access by name is supported.

I think it's bad to have the data in two places, but having the data only normalizer and having named access in properties would break build parallelism due to an opposite dependency. This implies putting named access in a separate crate.

The current build parallelism issue could be addressed by moving the type to a third crate while keeping two instances of the data. Kinda weird to have a one-off crate just for CanonicalCombiningClass. Potentially excessive to have a separate crate for all the property types.

@hsivonen
Copy link
Member Author

hsivonen commented Jul 1, 2024

(Not sure if the meeting minutes on this issue ended up on GitHub somewhere.)

It occurred to me that the approach discussed in the meeting of making the default features of icu_normalizer depend on icu_properties but making it possible for idna and icu_harfbuzz to opt out the default features to use a u8-returning API isn't that great as a built parallelism solution: The moment the application includes a dependency on icu_normalizer without opting out of default features, icu_normalizer would stop building in parallel with icu_properties even if the added dependency on icu_normalizer didn't care about the canonical combining class API at all (after all, the motivation for that API existing is icu_harfbuzz and the other currently known user is idna 1.0).

To not make it too easy to break build parallelism, it seems like a typed getter for canonical combining class should be a non-default feature (which would go against the current ICU4X design approach of not having to enable features for getting API surface).

@Manishearth
Copy link
Member

June 27

  • @hsivonen - We have these small struct wrappers in icu_properties. If we moved those to separate smaller crate we could parallelize builds better.
    • Is it only the struct?
  • @hsivonen - The struct and the functions on it.
    • Put it in icu_provider, with doc hidden? We do this for some locale fallback stuff.
    • I don't think we need to go that far. The reason is CanonicalCombiningClassMap; it's not even used in the normalizer, an extra thing for harfbuzz.
  • @hsivonen - Yes, and idna. The reason the CCC data is also in the normalizer is that it lives in the same trie as the normalization stuff.
    • But the CCCMap is not used by idna
  • @hsivonen - I think it is? Check the latest version. If you have a ZWJ, that's allowed if the CCC of the previous character is a virama.
    • I don't see it.
    • The only two APIs where this matters are the actual APIs that get you values of type CCC. We could, like, we have a get_u8 function and it gives you a u8 and then we have feature-flagged get_CCC that return the fancier types. The icu_properties CCC has name mapper, named constants, etc.
  • @sffc - We could have CCC that lives in both crates, with conversion between them. But I like the get_u8 approach. I don't think we should pollute icu_provider.
  • @hsivonen - I think what is a nicer user experience
    • Tradeoff is all the methods in CCC would no longer be methods in CCC. It has name getter methods. That's half the benefit of a standalone type.
  • @sffc - Open to the idea of an icu core crate (doesn't do anything). Quite a number of things to shove in there, including CCC. We've so far avoided it, it's probably best to continue to avoid it. I don't think there's any harm in a get_u8 function
  • @hsivonen - Do they use the constants?
    • I think collator but not normalizer.
  • We can have a test that asserts constants are the same.
  • @sffc - We want to keep the problem internal to our crates
    • I think there's a valid reason to split the properties crate in two features. This ties into the next topic. We can only reduce syn dependencies from certain things, like map and set getters. Also zerovec. This is where it starts getting tricky. If we want to make icu_properties free of zerovec-derive, do we need to add more gnarly things? That would be feature flagging, which could include maps and sets by default (except maybe compile times of baked data?), and then other things can be added in. It's potentially worth investigating. Just pointing out this has valid use cases outside of CCC.

Consensus:

  • If desired, we can add get_u8 and get32_u8 to the CCCMap in icu_normalizer, allowing us to feature-gate icu_properties. Other changes, such as reducing constants, are internal.
  • It would be a default feature, so it can be done post 2.0.
  • Can document choices on the APIs

LGTM: @sffc @younies (@hsivonen)

@sffc
Copy link
Member

sffc commented Jul 2, 2024

I'm fine making it be a non-default feature.

@Manishearth
Copy link
Member

Me too

@Manishearth
Copy link
Member

Though in that case we need to make the change now

@robertbastian robertbastian added discuss-priority Discuss at the next ICU4X meeting and removed discuss Discuss at a future ICU4X-SC meeting labels Jul 23, 2024
@sffc
Copy link
Member

sffc commented Aug 8, 2024

@robertbastian Do you agree with the above conclusion?

@sffc sffc added needs-approval One or more stakeholders need to approve proposal and removed discuss-priority Discuss at the next ICU4X meeting labels Aug 8, 2024
@hsivonen
Copy link
Member Author

hsivonen commented Aug 8, 2024

Note: When doing this, we should test the effect on the special cases in the Pernosco debugger. If this breaks them, we should make the feature gate for icu_properties also restore the internal types to match what Pernosco expects so that to debug, you'd opt into to icu_properties without having to ask for Pernosco to be adjusted (again) for ICU4X debugging.

@robertbastian
Copy link
Member

sgtm

@sffc sffc added S-small Size: One afternoon (small bug fix or enhancement) and removed needs-approval One or more stakeholders need to approve proposal labels Aug 9, 2024
@Manishearth Manishearth linked a pull request Sep 17, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization C-unicode Component: Props, sets, tries S-small Size: One afternoon (small bug fix or enhancement)
Projects
Status: Unclaimed for sprint
Development

Successfully merging a pull request may close this issue.

4 participants