Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept and resolve BCP47 language names (WIP) #1641

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Omikhleia
Copy link
Member

@Omikhleia Omikhleia commented Dec 2, 2022

Some "food for thought" regarding #1631

I fancied checking what adding \language[main=en-GB]1 to the SILE manual would imply...

Regard this a (kind of working but ugly) draft attempt.

It raises some interesting questions for the future:

  • Technical debt and/or code smell (apparently unused setting font.script, remainders of no-longer properly initialized SILE.languageSupport.languages, etc.)
  • Architecture design and internal API consistency (too many things with too many side effects stuffed in language resource file, hard to manage state-oriented API of fluent, uncertain need for cldr in the current implementation, some language files return a table while most do not, etc.)

How to move forward certainly requires a longer and tougher discussion.

Footnotes

  1. Or anything else you'd like. I played with "sr-Latn", which currently resolves to "sr" (= sr-Cyrl actually), no real interest obviously, except to see chapter titles in Cyrillic and hyphenations not working for the English text :)

@Omikhleia Omikhleia marked this pull request as draft December 2, 2022 01:09
SILE.settings:set("document.language", options.language)
fluent:set_locale(options.language)
-- BEGIN OMIKHLEIA HACKLANG
-- Commented out. This is BAD design:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you note Fluent-lua does maintain state: the current locale—and also the messages loaded in each locale. Since SILE can and does flip back and forth between calling things for different locales rather than just setting one and sticking to it, the only way the Fluent API should be used from SILE at this point is to always always set the locale before all message operations. That is being done everywhere else and should be done here too. It should never be used to read or write a value without explicitly setting a locale.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't like state-maintaining APIs, and think it could have been avoided.)

Notwithstanding, it doesn't have to be done here (in \font), that's the point of this commenting out:

  • SILE.languageSupport.loadLanguage() does it anyway when it needs to load messages.
  • The \fluent command does it when invoked for reading and rendering a message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A longer term proper usage here would be to instantiate and embed the correct Fluent messages class inside languageSupport when a language is loaded. The only "state" we're actually making use of is a toggle switch for which locale to access, the right way is to have each SILE language support module have it's own instance with the correlated locale.

-- The user may have set document.language to anything, let's ensure a canonical
-- BCP47 language...
if language ~= "und" then
language = icu.canonicalize_language(language)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually parse and match the closes canonical language or just validate the form (see these comments)? If the latter I would propose we actually need the former, and the right place to add that functionality would actually be the cldr-lua library not SILE.

Copy link
Member Author

@Omikhleia Omikhleia Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does what is implemented in the SILE wrapper i.e. several things:

  • Convert (possibly poor-formatted) BCP47 to correct Locale format.
    • This does not perform any kind of validation. E.g. ukuou remains ukuou (it's not forbidden by any standard, actually, to design "new language names"); en-Cyrl becomes en_Cyrl (well, English in Cyrillic script might not really exist, but likewise, why not).
    • It takes care of normalizing the format (capitals, etc.). E.g. en-us becomes en_US
    • Unrecognized elements are skipped
    • Note that it has actually some provision for scripts, variants, collations, etc. fr-u-ks-level2-kn becomes fr@colnumeric=yes;colstrength=secondary... I have not taken advantage of this in the collation-enabled sorting, but we could possibly... (there's an interesting question then, but it's a slightly different topic).
  • Then minimizes it according to the "canonical" rules for subtags, e.g. (refer to ICU for more details, these are just examples)
    • en_US becomes en (that's the default implication), while en_GB remains itself.
    • Likewise fr_FR becomes fr (but obviously fr_CA etc. stay)
    • sr_Cyrl becomes sr (again, the default implication is Cyrillic script), while sr_Latn stays, etc.
    • Likewise zh_Hans = zh (the norm, again), and zh-Hant stays itself.
  • Convert back to BCP47
    • So we are back with a well formed minimized or "canonical" BCP47

In the end, to keep the same kind of examples:
en-us => en
en-gb = en-GB
sr-Cyrl => sr
sr-Latn => sr-Latn
fr-u-ks-level2-kn = fr-u-kn-true-ks-level2 (yep, it also "normalizes" the optional ICU BCP47 extensions)

All of these are "valid".

What you are asking is something else, the presumed validity of a known language... Well, it doesn't do that.
ukuo-Kuou passes (it would be the "Ukuo" language in "Kuou" script. I just invented it)
en-Cyrl passes, as stated (it would be the English language in Cyrillic script. Perhaps someone even invented it one day).

But do we have to really care about non-existing combinations?

  • They wouldn't have any hyphenation rules anyway
  • They wouldn't have any i18n translation keys... Erm, wait, until someone decides to provide them ;-)

Hopefully maybe...
tlh-Piqd passes... that's Klingon in the pIqaD script. Yep, it does exist "officially", being even registered in ISO 15924.

This being said, if you want to filter "invalid" languages in cldr-lua, you have huge tables to maintain.

Currently cldr-lua is aligned or CLDR 36 from 2019? We are in 2022, use CLDR 42 at least...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion on ICU "canonicalize" also relates to #276 (when it was implemented).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that helps a lot!

But do we have to really care about non-existing combinations?

No. Nor do we care about invalid languages. By the same token we don't care about valid languages if we don't have any support for them. What we want is to fall back to the closest thing we support in a given context, and that might not be the same for all contexts. A font might support a more specific locale than we have different hyphenation rules or localized strings for.


-- BEGIN OMIKHLEIA HACKLANG
-- Disabled for now, see further below.
-- local cldr = require("cldr")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below, if ICU actually does what we need here that's fine, otherwise cldr-lua should be extended to provide the necessary functionality.

if res then
return res, langbcp47
end
langbcp47 = langbcp47:match("^(.+)-.*$") -- split at dash (-) and remove last part.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this matching/parsing bit needs to be in cldr-lua.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wherever it eventually ends, the utility is also useful for other packages. E.g. I wish I had it for my "smartquotes" package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ICU bits should probably get moved from the font code to some base language support module, and yes should be setup in a way that every package can use them. This bit which isn't really ICU still looks like it goes in CLDR to me and could also return information about what the segments mean.

Comment on lines +71 to +73
-- We need to find language resources for this BCP47 identifier, from the less specific
-- to the more general.
local langresource, matchedlang = forLanguage(language, function (lang)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole specific -> general fallback parsing is something CLDR should take care of. The only catch is we need to track TWO sets of available resources (SILE Lua libraries and Fluent messages) and take a specific request and fall back to first available general case for both (either in sync or first available for each).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd perhaps suggest splitting hyphenation, node breaking and number formatting as 3 separate resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just methods on a class? If a language doesn't provide a method it would fall back to the parent class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyphenations at least could be handled differently: it's hard to reconstruct them from the TeX sources (or whatever other sources) and update them in the middle of the rest of the code. Ideally we should be able to script the conversion of the patterns to a Lua table (I guess that's what was done initially), crediting the original sources, etc., and keep our node breaking / number formatting / etc. apart. That'd also make maintenance easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But otherwise, class(es) and inheritance would be better indeed than the current logic.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're talking about different things. I was thinking of the API structure not the file layout. If hyphenation data comes from an external source I think it would be totally fine to keep the raw data in a dedicated file and require it up from wherever we put our code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was embarrassed by the current file layout too, indeed. But if we are heading towards one class per file, we have the best of all worlds ;-)

-- Most language resource files act by side effects, directly tweaking
-- SILE.nodeMarkers.xx, SILE.hyphenator.languages.xx, SU.formatNumber.xx (etc.? Doh)
-- BUT some don't do that exactly AND return a table with an init method...
-- Unclear API discrepancy, heh.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the ones that act by side effect predate my attempts to bring some modularity to the interfaces. We need to switch to a module system that can subclass other modules when desired. We've done it for document classes, packages, inputters, outputters, typesetters and more: but not yet languages. We need to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants