-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transliterator IDs with unknown BCP47 IDs #3891
Comments
Do you have an example? I thought the idea with BCP-47 is that it is generalizable enough to represent anything you throw at it.
Why not just
We should check the lookup function first, and then load from the data if we didn't find anything. I don't have a problem when a power user injects a runtime transliterator that is redundant with one from the data file. They are a power user so it's fine to expect them to know how to clean up their datagen to not generate redundant transliterators. Since everything is being handled with aux keys, it should be easy to filter these using the same machinery for filtering other aux keys. |
Right, I should have gone into more detail there: The current approach does not do any programmatic conversion from legacy IDs (that are used in rule sources) to BCP47 IDs (that we use in the data struct). The way we can still use the BCP47 IDs is by relying on CLDR metadata and store a mapping using that metadata. BCP47 metadata is stored in the
We do not do any programmatic conversion because the conversion seems a bit underspecified, and legacy IDs need data for full length + 4-length script names ( |
OK. Well I think the main thing we should do is to make any public APIs, including the plugin API, use BCP-47, because otherwise these internal CLDR strings get exposed to clients, which we should try to avoid. For any transforms that don't have BCP-47 equivalents, can you just make up what they should be? There probably aren't more than a few dozen at most. So, like, at runtime, there should not be any legacy IDs found anywhere. I still am not clear on why you think we should keep them around. This is how we deal with IANA time zone names. We use BCP-47 time zone IDs everywhere at runtime, and we export a separate class that converts from IANA ot BCP-47 (#3499). |
My suggestion in the OP is to have some easily convertible representation for legacy IDs (e.g., x-B-C) that would allow
This works for CLDR ones, like the linked However, we could require the user to provide (at datagen time) BCP47 IDs for all custom transliterators they intend to use, which would also fix the problem. A quick note why programmatic legacy ID => BCP47 ID conversion is nontrivial:
If there's a better definition of the conversion that the one at the very top here: https://unicode.org/reports/tr35/tr35-general.html#Transforms, I haven't found it. I'm not saying it's impossible to programmatically convert, but I'm not sure if we want to invest the upstream work to get clarity on the conversion process right now? Footnotes
|
Discussion notes:
Resolution:
LGTM: @sffc @skius @robertbastian @eggrobin |
The code part is done, users are only exposed to BCP-47-T |
On that subject, the path is to make ICU support the new IDs, and then to migrate CLDR data. See ICU-22474. |
Currently, transliterator IDs are represented in DataLocales by putting their
-t-
ID into the aux key ofund
(#3765). This works for transliterators that exist at datagen-time and are not overridden by custom code-based impls at runtime.Issues arise in the following:
A user writes a rule-based transliterator
A-C
, that recursively callsB-C
.B-C
does not exist at datagen, because the user wants to provide their ML implementation at runtime.When compiling
A-C
, an ID forB-C
needs to be stored. Transliterators available at datagen time (e.g., shipped ones in ICU4X1) use the mentionedund+...-t- ID...
representation with the aux key.B-C
does not have this metadata available, because the user does not put this transliterator through the datagen pipeline (B-C
is ML), so there is no metadata needed.How do we refer to
B-C
in the datastruct forA-C
? Keeping in mind that at runtime, the user will need to be able to hook into dataloading using this ID with the lookup function they pass toTransliterator::new
.Assuming we have some non-BCP47 representation for these non-existent-at-datagen-time transliterators, maybe
und+x-LegacyID
or similar, we run into issues when users want to override an existing (e.g., shipped) ICU4X datagen transliterator. Because there is no strong signal during compilation that a certain transliterator will be overriden, the compiler compiles the BCP47 ID, so at runtime, the lookup function also needs to support the BCP47 ID.One fix for this is to take "this transform does not exist at datagen time" as a signal for "I want to override it during runtime", i.e.,
X-Z
andY-Z
(which is used byX-Z
) are shipped transliterators, the user wants to custom-implY-Z
, so they have to delete the source files forY-Z
. Now during runtime the lookup function only needs to support one kind of ID, one that's easily derivable from the Legacy ID (e.g.,und+x-LegacyID
). A downside to this approach is that it's (potentially?) cumbersome for the user, and quite confusing to debug if they forget to delete the shippedY-Z
source file.(BCP47 <> Legacy ID conversion is not obvious which is why this is an issue in the first place)
Discuss with:
Discuss:
B-C
be compiled? (Transliterators for which there is no BCP47 ID available)FnMut(source: &str, target: &str, variant: Option<&str>) -> impl Translit
? (akaLegacyID
)Y-Z
? (Overriding shipped transliterators)Footnotes
visible ones, internal ones have a different representation ↩
The text was updated successfully, but these errors were encountered: