-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use auxiliary keys instead of fixed locales for segmenter #4511
Comments
Please propose:
Note: adding aux keys is nice because then we can introduce fallback between different models, but then we need to define how that fallback works. |
This would basically remove these mappings: icu4x/provider/datagen/src/transform/segmenter/dictionary.rs Lines 19 to 39 in 9491b63
icu4x/provider/datagen/src/transform/segmenter/lstm.rs Lines 184 to 202 in 9491b63
We would have these combinations
If we support more models in the future, the segmenter constructor might do its own fallback between these. |
We could shorten
|
@robertbastian said:
I do think it's ICU4X's job to choose the names that make the most sense for the specific context in which those names are being used. But, if this gets us unblocked, I'm not opposed to adding a short name upstream, similar to how we get short names for properties from icuexportdata. |
In that case I think we should stick with the upstream names. Neither the size of the keys nor the time it takes the binary search over them are significant in the context of segmentation data/runtime. |
There has never been time put into figuring out what good names would be. These model names were invented by @SahandFarhoodi and I don't think they were intended to be publicly facing names. |
In terms of specific issues, the names are not BCP47-compatible, and I think we should try to keep our key attributes being BCP47-compatible. |
Sorry for my late response. |
Discussion:
LGTM: @robertbastian @sffc |
Segmenter locales are not locales in the sense that they participate in fallback, or that they are even exposed to the user. Instead of using hardcoded locales to look up segmenter payloads, we should use auxiliary keys.
The text was updated successfully, but these errors were encountered: