-
-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept and resolve BCP47 language names (WIP) #1641
base: master
Are you sure you want to change the base?
Accept and resolve BCP47 language names (WIP) #1641
Conversation
SILE.settings:set("document.language", options.language) | ||
fluent:set_locale(options.language) | ||
-- BEGIN OMIKHLEIA HACKLANG | ||
-- Commented out. This is BAD design: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you note Fluent-lua does maintain state: the current locale—and also the messages loaded in each locale. Since SILE can and does flip back and forth between calling things for different locales rather than just setting one and sticking to it, the only way the Fluent API should be used from SILE at this point is to always always set the locale before all message operations. That is being done everywhere else and should be done here too. It should never be used to read or write a value without explicitly setting a locale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't like state-maintaining APIs, and think it could have been avoided.)
Notwithstanding, it doesn't have to be done here (in \font
), that's the point of this commenting out:
SILE.languageSupport.loadLanguage()
does it anyway when it needs to load messages.- The
\fluent
command does it when invoked for reading and rendering a message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A longer term proper usage here would be to instantiate and embed the correct Fluent messages class inside languageSupport when a language is loaded. The only "state" we're actually making use of is a toggle switch for which locale to access, the right way is to have each SILE language support module have it's own instance with the correlated locale.
-- The user may have set document.language to anything, let's ensure a canonical | ||
-- BCP47 language... | ||
if language ~= "und" then | ||
language = icu.canonicalize_language(language) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually parse and match the closes canonical language or just validate the form (see these comments)? If the latter I would propose we actually need the former, and the right place to add that functionality would actually be the cldr-lua library not SILE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does what is implemented in the SILE wrapper i.e. several things:
- Convert (possibly poor-formatted) BCP47 to correct Locale format.
- This does not perform any kind of validation. E.g.
ukuou
remainsukuou
(it's not forbidden by any standard, actually, to design "new language names");en-Cyrl
becomesen_Cyrl
(well, English in Cyrillic script might not really exist, but likewise, why not). - It takes care of normalizing the format (capitals, etc.). E.g.
en-us
becomesen_US
- Unrecognized elements are skipped
- Note that it has actually some provision for scripts, variants, collations, etc.
fr-u-ks-level2-kn
becomesfr@colnumeric=yes;colstrength=secondary
... I have not taken advantage of this in the collation-enabled sorting, but we could possibly... (there's an interesting question then, but it's a slightly different topic).
- This does not perform any kind of validation. E.g.
- Then minimizes it according to the "canonical" rules for subtags, e.g. (refer to ICU for more details, these are just examples)
en_US
becomesen
(that's the default implication), whileen_GB
remains itself.- Likewise
fr_FR
becomesfr
(but obviouslyfr_CA
etc. stay) sr_Cyrl
becomessr
(again, the default implication is Cyrillic script), whilesr_Latn
stays, etc.- Likewise
zh_Hans
=zh
(the norm, again), andzh-Hant
stays itself.
- Convert back to BCP47
- So we are back with a well formed minimized or "canonical" BCP47
In the end, to keep the same kind of examples:
en-us
=> en
en-gb
= en-GB
sr-Cyrl
=> sr
sr-Latn
=> sr-Latn
fr-u-ks-level2-kn
= fr-u-kn-true-ks-level2
(yep, it also "normalizes" the optional ICU BCP47 extensions)
All of these are "valid".
What you are asking is something else, the presumed validity of a known language... Well, it doesn't do that.
ukuo-Kuou
passes (it would be the "Ukuo" language in "Kuou" script. I just invented it)
en-Cyrl
passes, as stated (it would be the English language in Cyrillic script. Perhaps someone even invented it one day).
But do we have to really care about non-existing combinations?
- They wouldn't have any hyphenation rules anyway
- They wouldn't have any i18n translation keys... Erm, wait, until someone decides to provide them ;-)
Hopefully maybe...
tlh-Piqd
passes... that's Klingon in the pIqaD script. Yep, it does exist "officially", being even registered in ISO 15924.
This being said, if you want to filter "invalid" languages in cldr-lua, you have huge tables to maintain.
- All scripts from ISO 15924 https://www.unicode.org/iso15924/iso15924-codes.html
- Lots of languages, regional variants, etc. https://icu4c-demos.unicode.org/icu-bin/locexp
Currently cldr-lua is aligned or CLDR 36 from 2019? We are in 2022, use CLDR 42 at least...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion on ICU "canonicalize" also relates to #276 (when it was implemented).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that helps a lot!
But do we have to really care about non-existing combinations?
No. Nor do we care about invalid languages. By the same token we don't care about valid languages if we don't have any support for them. What we want is to fall back to the closest thing we support in a given context, and that might not be the same for all contexts. A font might support a more specific locale than we have different hyphenation rules or localized strings for.
|
||
-- BEGIN OMIKHLEIA HACKLANG | ||
-- Disabled for now, see further below. | ||
-- local cldr = require("cldr") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below, if ICU actually does what we need here that's fine, otherwise cldr-lua should be extended to provide the necessary functionality.
if res then | ||
return res, langbcp47 | ||
end | ||
langbcp47 = langbcp47:match("^(.+)-.*$") -- split at dash (-) and remove last part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this matching/parsing bit needs to be in cldr-lua.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wherever it eventually ends, the utility is also useful for other packages. E.g. I wish I had it for my "smartquotes" package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ICU bits should probably get moved from the font code to some base language support module, and yes should be setup in a way that every package can use them. This bit which isn't really ICU still looks like it goes in CLDR to me and could also return information about what the segments mean.
-- We need to find language resources for this BCP47 identifier, from the less specific | ||
-- to the more general. | ||
local langresource, matchedlang = forLanguage(language, function (lang) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole specific -> general fallback parsing is something CLDR should take care of. The only catch is we need to track TWO sets of available resources (SILE Lua libraries and Fluent messages) and take a specific request and fall back to first available general case for both (either in sync or first available for each).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd perhaps suggest splitting hyphenation, node breaking and number formatting as 3 separate resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just methods on a class? If a language doesn't provide a method it would fall back to the parent class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyphenations at least could be handled differently: it's hard to reconstruct them from the TeX sources (or whatever other sources) and update them in the middle of the rest of the code. Ideally we should be able to script the conversion of the patterns to a Lua table (I guess that's what was done initially), crediting the original sources, etc., and keep our node breaking / number formatting / etc. apart. That'd also make maintenance easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(But otherwise, class(es) and inheritance would be better indeed than the current logic.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're talking about different things. I was thinking of the API structure not the file layout. If hyphenation data comes from an external source I think it would be totally fine to keep the raw data in a dedicated file and require it up from wherever we put our code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was embarrassed by the current file layout too, indeed. But if we are heading towards one class per file, we have the best of all worlds ;-)
-- Most language resource files act by side effects, directly tweaking | ||
-- SILE.nodeMarkers.xx, SILE.hyphenator.languages.xx, SU.formatNumber.xx (etc.? Doh) | ||
-- BUT some don't do that exactly AND return a table with an init method... | ||
-- Unclear API discrepancy, heh. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the ones that act by side effect predate my attempts to bring some modularity to the interfaces. We need to switch to a module system that can subclass other modules when desired. We've done it for document classes, packages, inputters, outputters, typesetters and more: but not yet languages. We need to.
Some "food for thought" regarding #1631
I fancied checking what adding
\language[main=en-GB]
1 to the SILE manual would imply...Regard this a (kind of working but ugly) draft attempt.
It raises some interesting questions for the future:
font.script
, remainders of no-longer properly initializedSILE.languageSupport.languages
, etc.)How to move forward certainly requires a longer and tougher discussion.
Footnotes
Or anything else you'd like. I played with "sr-Latn", which currently resolves to "sr" (= sr-Cyrl actually), no real interest obviously, except to see chapter titles in Cyrillic and hyphenations not working for the English text :) ↩