Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider splitting locale-fallbacking data out of icu_properties #5120

Closed
hsivonen opened this issue Jun 25, 2024 · 20 comments
Closed

Consider splitting locale-fallbacking data out of icu_properties #5120

hsivonen opened this issue Jun 25, 2024 · 20 comments
Assignees
Labels
C-unicode Component: Props, sets, tries discuss-priority Discuss at the next ICU4X meeting

Comments

@hsivonen
Copy link
Member

servo/rust-url#939 shows the build of icu_properties blocking on icu_locid_transform getting compiled. I gather the exemplar_chars module is the culprit.

We should consider splitting the exemplar_chars module, which is locale-dependent CLDR data and not locale-independent Unicode Database data into a different crate.

@hsivonen hsivonen added discuss Discuss at a future ICU4X-SC meeting C-unicode Component: Props, sets, tries labels Jun 25, 2024
@hsivonen hsivonen added this to the ICU4X 2.0 ⟨P1⟩ milestone Jun 25, 2024
@sffc
Copy link
Member

sffc commented Jun 25, 2024

I agree there's room for improvement here. If icu_properties is all singleton keys, we could likely drop the icu_locid_transform dependency (note: crates are being reorganized in 2.0, but we wouldn't be able to entirely avoid the LocaleFallbacker dependency so long as we have exemplar chars, I think.

@Manishearth
Copy link
Member

  • @sffc - If we didn't have exemplar_chars, this is already done, right? baked data knows when it needs fallback data.
    • Yes, and there might be ways to build custom baked data and also avoid the locale fallback parameter.
  • @hsivonen Custom baked data doesn't address the idna case since that uses crates.io baked data.
    • Yeah, if crates.io is desired, that doesn't work
  • @robertbastian could just put behind a feature flag
  • @hsivonen - As a matter of principle, why do we have CLDR stuff in icu_properties? It seems like it's a whole different thing. By what logic does exemplar chars end up in properties?
  • @Manishearth We're getting icu_locale as a high level thing yes? We should boot this into there (and maybe feature gate it there)? Never made sense to me as a part of icu_properties
  • @sffc - I think they went into icu_properties for lack of a better place. But now they can go into the icu_locale data-driven crate.
    • We could be proactive and, for crates that are grab-bags like locale, we could feature-gate everything. So exemplars goes into icu_locale with an on-by-default feature. Can be kicked out into a separate crate if people need.
  • @hsivonen - disagree that the use case of no-data locales is small. There's a lot you can do with locales without display names or exemplars.
  • @sffc - icu_locale_core exists for people who need smaller compile times and don't need exemplars and stuff
  • @Manishearth - either way, icu_locale has metacrate vibes to me, being proactive about features is good and helps avoid breaking --no-default-features users
  • @sffc note that we can't feature gate baked data
  • @Manishearth true, but we can feature gate the macro invocations and that's the bulk of compile time costs
  • @hsivonen - It looks like the _data crates are completely insignificant for compile times.
  • @echeran - I agree with the characterization of exemplars not having a good place to go. Does it really fit in icu_locale, though?
    • It's a question you ask a locale: "what are your exemplar chars?" It would also be valid for it to live completely separately.
  • @sffc - It's similar to locale directionality.
  • @hsivonen - It's like talking about, what's your first day of week, what's your temperature unit. But I think displaynames should go in a separate crate.
    • I'd rather end up with too few crates now rather than splitting without people asking.
    • Agree. The main cost of splitting in the future is breaking no-default-features users, but we can mitigate that with proactive default features. And it's okay to break no-default-features users.
  • @hsivonen - I observe Chrome, Safari, and Edge all patch display name data. But there's less a need to patch exemplars.

Concrete proposal:

  • exemplar_chars lives in icu_locale
  • feature gated, on by default
  • could be split into its own crate in the future but not now
  • No decision at this time on displaynames location

LGTM: @sffc @hsivonen

@robertbastian robertbastian added discuss-priority Discuss at the next ICU4X meeting and removed discuss Discuss at a future ICU4X-SC meeting labels Jul 23, 2024
@sffc
Copy link
Member

sffc commented Aug 8, 2024

@robertbastian Do you agree with the above conclusion?

@sffc sffc added needs-approval One or more stakeholders need to approve proposal and removed discuss-priority Discuss at the next ICU4X meeting labels Aug 8, 2024
@sffc sffc added this to icu4x 2.0 Aug 8, 2024
@sffc sffc moved this to Needs discussion to unblock in icu4x 2.0 Aug 8, 2024
@robertbastian
Copy link
Member

I don't see the value of feature-gating them. We're not feature-gating existing icu_locale functionality.

@sffc
Copy link
Member

sffc commented Aug 9, 2024

Does the following unattributed comment (I think from @Manishearth?) address your concern?

The main cost of splitting [the exemplar chars functionality into a child crate] in the future is breaking no-default-features users, but we can mitigate that with proactive default features. And it's okay to break no-default-features users.

@robertbastian
Copy link
Member

I'm asking what the cost of not splitting exemplar chars into a child crate is.

@sffc
Copy link
Member

sffc commented Aug 9, 2024

I don't think that's a question that can be answered at the moment since this is about future-proofing. Maybe exemplar chars data decides to add a lot more details, like rules to determine phonemes, what is a vowel, legal syllables, etc, and maybe that data gets big enough that we don't want it unconditionally in icu_locale.

@sffc
Copy link
Member

sffc commented Aug 15, 2024

@robertbastian confirms consensus on the first bullet point: "exemplar_chars lives in icu_locale" but not bullet points 2-3.

@sffc
Copy link
Member

sffc commented Aug 15, 2024

This is still in the 2.0 critical path because we are not in alignment about whether adding it as a default feature in the future is a breaking change.

@robertbastian
Copy link
Member

It's a bit annoying to put something that needs fallback into the icu_locale crate, as that leads to a circular dependency with the data crate (which reexports icu_locale) if the data needs it. We'd have to special case it.

@sffc
Copy link
Member

sffc commented Aug 20, 2024

Is that issue resolved if we move forward with the previously proposed icu_locale_fallback crate?

@robertbastian
Copy link
Member

It's not an issue like a circular dependency, it's just that this one crate would need to access itself in order to obtain fallback information. It just feels cleaner if icu_locale only had singletons, but the same goes for icu_properties and the exemplar chars have to live somewhere.

icu_locale would reexport icu_locale_fallback, so splitting it out doesn't change anything.

@sffc
Copy link
Member

sffc commented Aug 20, 2024

The issue was in the data crates. The data crates can use icu_locale_fallback instead of icu_locale, avoiding the circular dependency. Right?

@sffc
Copy link
Member

sffc commented Aug 20, 2024

My understanding of the issue you raised was icu_locale -> icu_locale_data -> icu_locale, which we can change to icu_locale -> icu_locale_data -> icu_locale_fallback

@robertbastian
Copy link
Member

We can also change it to , icu_locale_data is only usable if icu_locale is available, and the data crates don't reexport the crate they're providing data for, as that's circular. As I said, there's no circular dependency issue, it's just that we need to special-case the crate in our script that generates data crates.

@sffc sffc added discuss-priority Discuss at the next ICU4X meeting and removed needs-approval One or more stakeholders need to approve proposal labels Aug 20, 2024
@sffc
Copy link
Member

sffc commented Aug 20, 2024

About the exemplar chars feature: I discussed it with @robertbastian and I'm convinced by this argument: "we don't feature-gate classes in other crates that we don't plan to split out into a child crate, so we shouldn't do it in icu_locale".

Our design for icu_locale has landed in a funny area where it is both a component and a metacrate: it is a component because it defines its own types (like canonicalizer), and it is a metacrate because it includes things from child crates (like Locale and eventually Fallbacker). I had been working on the not-fully-verbalized presumption that icu_locale would have features for all of its child crates so that functionality of icu_locale can be accessed without including all of the child components, except icu_locale_core which is core functionality that every class in icu_locale requires. For example, Canonicalizer and LocaleDirectionality do not need fallback, so icu_locale_fallback should be feature-gated. Likewise, a possible future icu_locale_displaynames should be feature-gated.

Why do we care about letting clients disable components they don't need?

  1. Being modular is the first principle of ICU4X.
  2. Crates are somewhat costly to clients who vendor, because each individual crate triggers review processes, so we should allow clients to select which crates they actually need for their use case.
  3. We have concerns from some clients about compile times, and crates are a clean, natural place to draw the line for reducing compile times. We do not currently have component-specific features for the purposes of reducing compile times and I'm not proposing we do so at this time, but we should do it for crates.

Now I can see an argument that icu_locale_fallback falls into the same bucket as icu_locale_core in terms of being widely needed by all icu_locale functionality. I agree that icu_locale_fallback is needed by probably 90% of clients of icu_locale. However, I can identify legitimate use cases where it is not needed:

  1. People who need exemplar chars but always query with base languages. This might be the case in text layout engines spell checkers, for example.
  2. People who need only LocaleDirectionality. Could be useful in layout engines.
  3. People who need LocaleCanonicalizer but use something other than ICU4X for their data loading. Might happen if people are using Fluent, for example.

Since these use cases exist, icu_locale_fallback is optional, and optional dependencies should be feature-gated. This is a general-purpose yardstick that we have applied for many years.

Note: the feature icu_locale/compiled_data should enable icu_locale/dep:icu_locale_fallback without the fallback APIs being re-exported. The feature icu_locale/fallback should enable the dependency and also the APIs.

Another potential area of inconsistency is that we don't always feature-gate components that are optional in other components. This is a good call-out that we might want to be more consistent about. For example, icu_datetime should perhaps gain a timezone feature that disables icu_timezone. I would consider that a bug that we want to fix.

@robertbastian Did I miss anything?

@robertbastian
Copy link
Member

I'm not convinced by any of those use cases. Text layout engines don't need exemplar chars or locale directionality, they use bidi properties.

I still don't see how having these features is not going to be messy, but if I'm overruled I'm overruled.

@sffc
Copy link
Member

sffc commented Aug 20, 2024

You're right. The one user we know we have for exemplar chars uses it for spell checking. I corrected my list to say "spell checkers" instead of "layout engines". I also changed LocaleDirectionality to say "layout engines" instead of "text layout engines" because you can use LocaleDirectionality to decide between a RTL or LTR layout of your app.

Do you agree that these are legitimate use cases that amount to something on the order of 10% or more of the users of icu_locale? Or is your position still that icu_locale_fallback is core to icu_locale?

Separate from icu_locale_fallback, we may have icu_locale_displaynames, which is definitely not needed for all users of icu_locale and which definitely contributes to compile times (I can produce evidence for those claims upon request). Are you in alignment that icu_locale_displaynames is feature-gated in icu_locale? If not, what other solution would you like to propose for people who need icu_locale without display names?

@sffc
Copy link
Member

sffc commented Aug 21, 2024

I still don't see how having these features is not going to be messy, but if I'm overruled I'm overruled.

I prefer to achieve consensus and mutual understanding of values. This isn't about politics; it's about weighing the tradeoffs. I would like to understand your perspective better on features being "messy" and how we should weight that cost relative to the benefits I've laid out.

@robertbastian
Copy link
Member

Current state of the repo: icu_locale exists but has no Cargo features.

  • @hsivonen My main interest in having the features was making sure stuff didn't get compiled if it wasn't needed. If the locale fallbacking stuff is out of the way of icu_properties, I'm okay.
  • @robertbastian We're not going to add features for everything, for every piece of functionality. It's hard to predict. So let's just wait until they tell us.
  • @sffc I would love if we just say that our stability policy is that we can break no-default-features.
  • @Manishearth We should be. The ecosystem consensus is that this is non-breaking
  • @robertbastian We should be careful when we do it.

Proposal:

  • Cargo features should only be introduced if they gate dependencies, not only code
  • If a client needs it, we can carefully split crates later, introducing default features, even if it breaks no-default-features

LGTM: @robertbastian @sffc @Manishearth

@github-project-automation github-project-automation bot moved this from Needs discussion to unblock to Done in icu4x 2.0 Sep 16, 2024
@robertbastian robertbastian self-assigned this Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries discuss-priority Discuss at the next ICU4X meeting
Projects
Status: Done
Development

No branches or pull requests

4 participants