Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECMA-402 compatibility for the components::Bag #645

Open
9 of 11 tasks
gregtatum opened this issue Apr 13, 2021 · 3 comments
Open
9 of 11 tasks

ECMA-402 compatibility for the components::Bag #645

gregtatum opened this issue Apr 13, 2021 · 3 comments
Assignees
Labels
C-datetime Component: datetime, calendars, time zones S-epic Size: Major project (create smaller child issues) T-task Type: Tracking thread for a non-code task

Comments

@gregtatum
Copy link
Member

gregtatum commented Apr 13, 2021

Edit:

This issue originally had discussion around the approach for the components bag, but now it serves as the issue to track ECMA-402 compatibility for the components::Bag.

Current scope and status of specified work


Original outdated discussion:

For ICU4X 0.2, we are shipping the first support for the components::Bag. This API is what will back our ECMA-402 compatibility. For Mozilla, we'll need this support in order to validate ICU4X as a good architectural fit. Currently, the partial implementation is probably good enough for some initial validation, but it can't be used for real without completing more scope of work.

Balancing those needs, we also have iterated through various ideas of improvements for the components bag API that require broader changes to the CLDR data.

Remaining questions on ECMA-402 support

I'm somewhat unclear on what other scope of work needs to be completed for complete ECMA-402 support. There are a few fields that are not yet supported. Are non-Gregorian calendar systems required to be compatible here? It would be good to make sure this is well known. I would assume test262 would reveal this.

Improvements to the translations

Part of the areas for improvement identified through the work above, is a change to the model of how the components bag gets built. Currently, ECMA-402 and ICU4X's component bag work with a per-component toggle and "length" designation. The proposed change to help improve on the translations is to provide a per-component toggle, and then a grouping "length", e.g. date length, time length, time zone length.

The proposed advantages here are that translations can provide a higher fidelity translation for the overall grouping, and it reduces the total combinatorial logic for the compononents bag. #605 outlines some thoughts on how to group those components, and comes up with a total possibility of 3075 component bag combinations.

Currently, the CLDR includes at minimum somewhere around 50 combinations per-locale. The following mechanisms act as a "compression algorithm" in order to reduce the total amount of combinations, and allow for algorithmic expansion of these combinations to a final pattern. These steps are:

a. Glue together date matches and time matches into a single pattern. (completed #617)
b. Expand the length of individual components
c. Use append items for missing components

Of these steps, step b and c are somewhat controversial. It turns out that the expansions of lengths have some guard rails, in that skeletons can provide some overrides where these expansions form nonsense translations. In fact, there is an implicit assumption that the skeletons will be expanded, and only the bare minimum of patterns need to be provided, as a form of compression in the CLDR data itself.

Append items is where things get a little tricky. There are some clearly orthogonal components where append items is non-controversial (e.g. time zones). However, the existing UTS 35 specification says that a result will be computed. There is not a good guarantee that the result will be high quality. However, I don't know of specific low quality translations for this area.

Bounding potential component combinations

The CLDR data and ICU4C operate on the assumption that the data can be combined in any way, and then an algorithmic result can be generated. However, there are some ideas on getting around this and completely skipping steps a, b, and c. This is outlined in #605, and works by restricting the inputs to the components bag, and enumerating every combination. As stated in #605, this is tentatively 3075 combinations. In this way, the algorithmic steps (a, b, c) are not needed, as the CLDR data would be completely specified.

In my mind, here are the steps to accomplish this, presented in an un-ordered list.

  • Do the engineering design work to create a new format for the CLDR data that includes all 3075 combinations.
  • Document and specify this work for inclusion in CLDR. (I'm not sure what this means for other consumers of CLDR data, and whether they will require updating the data. It's not clear to me if the availableFormats work goes away, or this is additional data.)
  • In ICU4X, design and write a CLDR transform to adequately compress this data into a reasonable size payload for the ICU4X data providers.
  • Re-work the components bag API to support the above changes.
  • Validate the correctness and improvement to the translations for different use cases, and run through the data-driven testing pipeline.

Incremental approach

Currently, the CLDR proposal timeline is fast approaching at 2021 May 19. The remaining amount of time is not enough to flesh out a high quality improved model proposal for CLDR and validate it against stakeholders and ECMA-402. There has been a lot of theoretical talk of improvements, without data-driven proof of concepts. The work that has landed in the last week, and the work with Mozilla's integration of ICU4X in Gecko will provide helpful validation for the changes here.

There's also the idea that you bring design proposals, not design problems to the CLDR group. At this point, I don't feel confident in having a total solution to the combinations generated, and the compression that can be applied to this data. This seems like an active problem still that needs to be designed.

I don't think missing this year's deadline for the CLDR change will be a failure for getting these improvements. Working on the infrastructure incrementally will allow us to have a good set of data-driven validations that we can run through. In addition, with the integration with Gecko, we gain access to the full suite of test262 to ensure spec combatibility.

I believe finishing the UTS 35 skeleton matching algorithm (points a, b, c above) will unblock the current work, and is a minimal investment of work to higher quality results in the short term. This will buy us time to figure out exactly how to define the CLDR data, and validate that it is getting good results. This mitigates the risk of proposing a large change for the translators of the CLDR, that is only theoretically going to provide an improvement. It would be good to validate these changes with our current system, and with our partners and users of ICU4X.

Proposal for 1.1

In my mind, this is positioned well to align with the 1.1 or 1.2 release of ICU4X, as we will have a good set of features completed. The design can be an on-going process that does not block other priority work. These changes will be a good quality boost, and are worth pursuing. The changes can be done internally, while maintaining ECMA 402 support. Once completed, the new API can be swapped out.

@gregtatum gregtatum added discuss Discuss at a future ICU4X-SC meeting C-datetime Component: datetime, calendars, time zones S-epic Size: Major project (create smaller child issues) labels Apr 13, 2021
@sffc sffc added T-task Type: Tracking thread for a non-code task and removed discuss Discuss at a future ICU4X-SC meeting labels May 13, 2021
@sffc sffc added this to the 2021 Q2-m3 milestone May 13, 2021
@sffc
Copy link
Member

sffc commented May 13, 2021

Assigning this to @gregtatum to investigate. When ready to discuss further, re-add the "discuss" label.

@sffc sffc modified the milestones: 2021 Q3-m1, ICU4X 0.4 Aug 12, 2021
@sffc
Copy link
Member

sffc commented Oct 21, 2021

@gregtatum to file issues for remaining issues and then close this one.

@gregtatum gregtatum changed the title ECMA-402 compatibility and improved translations components bag ECMA-402 compatibility for the components::Bag Nov 18, 2021
@gregtatum gregtatum removed their assignment Nov 18, 2021
@sffc sffc modified the milestones: 2021 Q4 0.5 Sprint B, ICU4X 0.6 Nov 18, 2021
@gregtatum
Copy link
Member Author

I updated this bug to be the official ECMA-402 support for components::Bag, since it already enumerated the issues left.

@sffc sffc modified the milestones: ICU4X 0.6, ICU4X 1.1 May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-datetime Component: datetime, calendars, time zones S-epic Size: Major project (create smaller child issues) T-task Type: Tracking thread for a non-code task
Projects
None yet
Development

No branches or pull requests

3 participants