Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create stabilization_checklist.md #3561

Closed
wants to merge 1 commit into from
Closed

Create stabilization_checklist.md #3561

wants to merge 1 commit into from

Conversation

sffc
Copy link
Member

@sffc sffc commented Jun 22, 2023

Fixes #3560

@sffc sffc requested a review from a team as a code owner June 22, 2023 04:56
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we should also copy in all of the #3397 bullet points and organize appropriately

2. [ ] **FFI is complete.** There should be no unjustified suppressions nor entries in missing_apis.txt.
3. [ ] **Docs are complete.** In addition to covering every exported function, there needs to be a crate-level overview example, and all options should be covered by at least one docs test.
4. [ ] **Data story is complete.** The code and data should be highly modular, such that users do not need to carry more code or data than they need. The data should be zero-copy of course, and it should make use of the abstractions available in the zerovec crate. If datagen requires a new feature in order for the new data to be modularized, that feature should be implemented.
5. [ ] **Feature exhibits i18n correctness.** There should be no known gaps in localization quality, meaning that for the component in question, a user in any CLDR locale should receive an experience on par with that of a user in English. In other words, the component does not need to be feature-complete, but of the features that are supported, they should be fully implemented.
Copy link
Member

@Manishearth Manishearth Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think this is rather strict and potentially ill-defined. For example, as Markus has mentioned in the context of #3552 , that was not a feature ICU4C had for a long time: it's a relatively niche thing that required a bunch of work to design and implement, and was done after the fact. My impression is that it was not an issue that there were clear opinions on amongst the user community (i.e. it was not a glaring hole)

In general I think ICU4X is going to have bugs and gaps. We have some known ones in segmentation as well because we implement spec rules rather than CLDR ones. I would consider that a gap in localization quality as well: the CLDR rules are known to be better than the spec (and the spec shall soon update to them). But segmentation is stable.

And this "gap" in segmentation is also definitely something which is niche enough to not have a clear opinion from the user community; it's not a glaring hole. Indic users will not be surprised by the lack of akshara-handling segmentation, they will prefer akshara segmentation but it's not a huge deal.

Whether or not a gap of this kind is a stabilization blocker, in this policy, largely depends on whether or not it's discovered before stabilization. If I had not tried the niche Greek ICU4C tests I would not have noticed this until later, since I did not have plans to import ICU4C's entire testsuite by hand.

I think a better rule is that each such feature gap is evaluated on a per-case basis, and, importantly, we ensure we are not blocked out of implementing it in the future. (Ideally, we document the gap as well). In general "not a glaring hole" is an okay litmus test for me here.

Copy link
Member Author

@sffc sffc Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of situation I would like to avoid is one where half-baked (or even 80%-baked) components spawn bugs that sit around for years with no one motivated to fix them. The stabilization of a component is the only tool that ICU4X The Open Source Project holds to put real pressure on getting things fixed. After something ships, follow-up bugs just go in the pile of backlog issues and have a high likelihood of never getting fixed.

I'm not as happy as I would like to be about the state of the Segmenter component, as you know, but business pressure added up quarter after quarter and we decided to ship the best that the team could achieve. Also, with Segmenter, we had (and still have) resourcing established at the time of release in order to follow up on the remaining issues. I really don't want to use the state of Segmenter as a precedent for what happens in other components.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Segmenter is an interesting example even absent the specific situation around ICU4X: because the rules I am talking about took ages to bring to consensus. If we had stabilized segmenter a couple years ago we would have legitimately been able to claim i18n-completeness, and in 2022/2023 an i18n gap would have magically appeared as the goalposts had moved.

Because that's the thing about i18n completeness: it is a constantly moving goalpost. A lot of these things do not have One True Answer, but rather after careful research and design work people come up with a solution that balances the tradeoffs effectively. The akshara segmentation rules have gone into drafts of the spec, been taken out because people disagreed, been put in CLDR as an experiment, and are now on their way back into the spec. The previous definition of grapheme cluster was still perfectly workable from an i18n standpoint; it just did not support a specific use case people wished to use it for, but could very legitimately be claimed as being out of scope for what grapheme clusters are for.

A lot of what Unicode algorithm work does is to try really really hard to take a concept (like casing, or "words") and figure out a way that it can be stretched to fit across languages and locales. It is a sisyphean task (but so is the rest of Unicode). Quite often it's the case that there is no equivalent concept on the other end, or, there are multiple such concepts that mean slightly different things. The akshara situation is a case of figuring out which one of these concepts to pick: do you pick orthographic syllables to be your grapheme clusters, and if so, how do you handle conjuncts? The Greek uppercasing thing is quite similar: it does appear to be the case that accent marks are found in some full-uppercased contexts, so there's nuance as to which one to pick as the default. ICU4C eventually picked the more common one, which is great, but I think it's hard calling this kind of selection "i18n-completeness" because there is not a single correct answer. In fact, the reason Greek tends to do this is apparently because of a lack of uppercase accents in the early days of computing forming habits; one could easily argue that it is more i18n-complete to undo this wrong.

Specifics aside, this kind of thing is going to happen regardless of whether such nuances are settled on before or after stabilization: if stabilization is our only lever for getting these fixed that obscures a potentially larger problem where such things that crop up post-stabilization will never get fixed. Because they will crop up as Unicode continues to listen to users and form measured, well-informed opinions as to which approach appropriately balances the tradeoffs between different users (or use-cases).

The stabilization of a component is the only tool that ICU4X The Open Source Project holds to put real pressure on getting things fixed

I actually don't agree on this: I think client needs are a pretty strong force here as well. I think we would be very careful selling something without ICU4C parity when it coems to i18n-completeness to a production user: for example, we cannot recommend icu_casemap to browser authors as long as the Greek thing isn't fixed. But I don't think that ought to block stabilization: There are plenty of clients for whom this may not matter at all. (e.g. it does not matter for Flutter).

(this might be worth discussing in zrh next week)

Copy link
Member Author

@sffc sffc Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't agree on this: I think client needs are a pretty strong force here as well. I think we would be very careful selling something without ICU4C parity when it coems to i18n-completeness to a production user: for example, we cannot recommend icu_casemap to browser authors as long as the Greek thing isn't fixed. But I don't think that ought to block stabilization: There are plenty of clients for whom this may not matter at all. (e.g. it does not matter for Flutter).

So in other words, the other lever is parity with ICU4C. It is rare for a client to put any amount of resourcing behind fixing an i18n correctness issue for the sake of i18n correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assertion here is that when it comes to a lot of this stuff, "parity with ICU4C" is not materially different from "i18n correct", because there is not One True Answer for many of these things and instead someone has to set an industry standard as to what these operations mean in a pan-linguistic way. Sometimes that is the Unicode standard, sometimes it is ICU4C. Hopefully we may be the ones blazing those frontiers as well in the future as we start picking up things not covered by ICU4C.

This is not the case for all of this stuff. It is the case for akshara segmentation and I would argue Greek casemapping as well (bear in mind: ICU4C's Greek casemapping behavior is a divergence from the spec!). There are absolutely some things where there is One True Answer. It's not always obvious when that is the case.

I do think that's the crux of it: There's different types of i18n correctness, some of them are absolute and unchanging, whereas some of them are a carefully weighed tradeoff which may come up at any time. I would absolutely be in support of a rule that prevented us from shipping e.g. plural rule support that couldn't handle 5 or 6 plural forms, or decimal formatting support that doesn't handle systems where the separators are not evenly spaced apart. But I would be much more reluctant for us to do this for akshara segmentation or Greek casing which are pretty nuanced issues with multiple potential positions. (I would still very much be in support of us documenting our lack of support for those). Because fundamentally, the latter kind can crop up any time, and we are not guaranteed to know all or even most of them before stabilization.


(I picked browsers as an example because it's a case where the Web has collectively decided that ICU4C's behavior is standard. This is de-facto, not de-jure, since the spec mostly says "do what the Unicode standard does, but you can also do extra per-language stuff".)

@sffc
Copy link
Member Author

sffc commented Jun 22, 2023

#3397?

😆

I'll pick one of the two PRs and pull the best parts of each one into the other

@sffc
Copy link
Member Author

sffc commented Jun 28, 2023

Closing in favor of #3397

@sffc sffc closed this Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write down the checklist for stabilizing a component
3 participants