Accept and resolve BCP47 language names (WIP) #1641

Omikhleia · 2022-12-02T01:09:09Z

Some "food for thought" regarding #1631

I fancied checking what adding \language[main=en-GB]¹ to the SILE manual would imply...

Regard this a (kind of working but ugly) draft attempt.

It raises some interesting questions for the future:

Technical debt and/or code smell (apparently unused setting font.script, remainders of no-longer properly initialized SILE.languageSupport.languages, etc.)
Architecture design and internal API consistency (too many things with too many side effects stuffed in language resource file, hard to manage state-oriented API of fluent, uncertain need for cldr in the current implementation, some language files return a table while most do not, etc.)

How to move forward certainly requires a longer and tougher discussion.

Or anything else you'd like. I played with "sr-Latn", which currently resolves to "sr" (= sr-Cyrl actually), no real interest obviously, except to see chapter titles in Cyrillic and hyphenations not working for the English text :) ↩

alerque · 2022-12-02T14:38:43Z

core/font.lua

    SILE.settings:set("document.language", options.language)
-    fluent:set_locale(options.language)
+    -- BEGIN OMIKHLEIA HACKLANG
+    -- Commented out. This is BAD design:


As you note Fluent-lua does maintain state: the current locale—and also the messages loaded in each locale. Since SILE can and does flip back and forth between calling things for different locales rather than just setting one and sticking to it, the only way the Fluent API should be used from SILE at this point is to always always set the locale before all message operations. That is being done everywhere else and should be done here too. It should never be used to read or write a value without explicitly setting a locale.

(I don't like state-maintaining APIs, and think it could have been avoided.)

Notwithstanding, it doesn't have to be done here (in \font), that's the point of this commenting out:

SILE.languageSupport.loadLanguage() does it anyway when it needs to load messages.

The \fluent command does it when invoked for reading and rendering a message

A longer term proper usage here would be to instantiate and embed the correct Fluent messages class inside languageSupport when a language is loaded. The only "state" we're actually making use of is a toggle switch for which locale to access, the right way is to have each SILE language support module have it's own instance with the correlated locale.

alerque · 2022-12-02T14:44:12Z

core/languages.lua

+    -- The user may have set document.language to anything, let's ensure a canonical
+    -- BCP47 language...
+    if language ~= "und" then
+      language = icu.canonicalize_language(language)


Does this actually parse and match the closes canonical language or just validate the form (see these comments)? If the latter I would propose we actually need the former, and the right place to add that functionality would actually be the cldr-lua library not SILE.

It does what is implemented in the SILE wrapper i.e. several things:

Convert (possibly poor-formatted) BCP47 to correct Locale format.

This does not perform any kind of validation. E.g. ukuou remains ukuou (it's not forbidden by any standard, actually, to design "new language names"); en-Cyrl becomes en_Cyrl (well, English in Cyrillic script might not really exist, but likewise, why not).

It takes care of normalizing the format (capitals, etc.). E.g. en-us becomes en_US

Unrecognized elements are skipped

Note that it has actually some provision for scripts, variants, collations, etc. fr-u-ks-level2-kn becomes fr@colnumeric=yes;colstrength=secondary... I have not taken advantage of this in the collation-enabled sorting, but we could possibly... (there's an interesting question then, but it's a slightly different topic).

Then minimizes it according to the "canonical" rules for subtags, e.g. (refer to ICU for more details, these are just examples)

en_US becomes en (that's the default implication), while en_GB remains itself.

Likewise fr_FR becomes fr (but obviously fr_CA etc. stay)

sr_Cyrl becomes sr (again, the default implication is Cyrillic script), while sr_Latn stays, etc.

Likewise zh_Hans = zh (the norm, again), and zh-Hant stays itself.

Convert back to BCP47

So we are back with a well formed minimized or "canonical" BCP47

In the end, to keep the same kind of examples:
en-us => en
en-gb = en-GB
sr-Cyrl => sr
sr-Latn => sr-Latn
fr-u-ks-level2-kn = fr-u-kn-true-ks-level2 (yep, it also "normalizes" the optional ICU BCP47 extensions)

All of these are "valid".

What you are asking is something else, the presumed validity of a known language... Well, it doesn't do that.
ukuo-Kuou passes (it would be the "Ukuo" language in "Kuou" script. I just invented it)
en-Cyrl passes, as stated (it would be the English language in Cyrillic script. Perhaps someone even invented it one day).

But do we have to really care about non-existing combinations?

They wouldn't have any hyphenation rules anyway

They wouldn't have any i18n translation keys... Erm, wait, until someone decides to provide them ;-)

Hopefully maybe...
tlh-Piqd passes... that's Klingon in the pIqaD script. Yep, it does exist "officially", being even registered in ISO 15924.

This being said, if you want to filter "invalid" languages in cldr-lua, you have huge tables to maintain.

All scripts from ISO 15924 https://www.unicode.org/iso15924/iso15924-codes.html

Lots of languages, regional variants, etc. https://icu4c-demos.unicode.org/icu-bin/locexp

Currently cldr-lua is aligned or CLDR 36 from 2019? We are in 2022, use CLDR 42 at least...

The discussion on ICU "canonicalize" also relates to #276 (when it was implemented).

Thanks, that helps a lot!

But do we have to really care about non-existing combinations?

No. Nor do we care about invalid languages. By the same token we don't care about valid languages if we don't have any support for them. What we want is to fall back to the closest thing we support in a given context, and that might not be the same for all contexts. A font might support a more specific locale than we have different hyphenation rules or localized strings for.

alerque · 2022-12-02T14:45:07Z

core/languages.lua

+
+-- BEGIN OMIKHLEIA HACKLANG
+-- Disabled for now, see further below.
+--   local cldr = require("cldr")


See comment below, if ICU actually does what we need here that's fine, otherwise cldr-lua should be extended to provide the necessary functionality.

alerque · 2022-12-02T14:46:16Z

core/languages.lua

+    if res then
+      return res, langbcp47
+    end
+    langbcp47 = langbcp47:match("^(.+)-.*$") -- split at dash (-) and remove last part.


I think this matching/parsing bit needs to be in cldr-lua.

Wherever it eventually ends, the utility is also useful for other packages. E.g. I wish I had it for my "smartquotes" package.

The ICU bits should probably get moved from the font code to some base language support module, and yes should be setup in a way that every package can use them. This bit which isn't really ICU still looks like it goes in CLDR to me and could also return information about what the segments mean.

alerque · 2022-12-02T14:50:59Z

core/languages.lua

+    -- We need to find language resources for this BCP47 identifier, from the less specific
+    -- to the more general.
+    local langresource, matchedlang = forLanguage(language, function (lang)


This whole specific -> general fallback parsing is something CLDR should take care of. The only catch is we need to track TWO sets of available resources (SILE Lua libraries and Fluent messages) and take a specific request and fall back to first available general case for both (either in sync or first available for each).

I'd perhaps suggest splitting hyphenation, node breaking and number formatting as 3 separate resources.

Why not just methods on a class? If a language doesn't provide a method it would fall back to the parent class.

Hyphenations at least could be handled differently: it's hard to reconstruct them from the TeX sources (or whatever other sources) and update them in the middle of the rest of the code. Ideally we should be able to script the conversion of the patterns to a Lua table (I guess that's what was done initially), crediting the original sources, etc., and keep our node breaking / number formatting / etc. apart. That'd also make maintenance easier.

(But otherwise, class(es) and inheritance would be better indeed than the current logic.)

I think we're talking about different things. I was thinking of the API structure not the file layout. If hyphenation data comes from an external source I think it would be totally fine to keep the raw data in a dedicated file and require it up from wherever we put our code.

I was embarrassed by the current file layout too, indeed. But if we are heading towards one class per file, we have the best of all worlds ;-)

alerque · 2022-12-02T14:53:19Z

core/languages.lua

+    -- Most language resource files act by side effects, directly tweaking
+    -- SILE.nodeMarkers.xx, SILE.hyphenator.languages.xx, SU.formatNumber.xx (etc.? Doh)
+    -- BUT some don't do that exactly AND return a table with an init method...
+    -- Unclear API discrepancy, heh.


Yes, the ones that act by side effect predate my attempts to bring some modularity to the interfaces. We need to switch to a module system that can subclass other modules when desired. We've done it for document classes, packages, inputters, outputters, typesetters and more: but not yet languages. We need to.

feat(core): Accept and resolve BCP47 language names (WIP)

261c295

Omikhleia requested review from alerque and simoncozens as code owners December 2, 2022 01:09

Omikhleia marked this pull request as draft December 2, 2022 01:09

alerque reviewed Dec 2, 2022

View reviewed changes

Omikhleia mentioned this pull request Aug 4, 2023

Allow full qualified language names (BCP47) Omikhleia/markdown.sile#91

Closed

Omikhleia mentioned this pull request Nov 18, 2023

Clarify the Markdown scenarios #1866

Open

3 tasks

Omikhleia self-assigned this Jan 14, 2024

Omikhleia mentioned this pull request Jan 21, 2024

Double hyphens in compounds words in Czech, Portuguese, etc. #1963

Closed

alerque mentioned this pull request Jan 27, 2024

Generalize/refactor language-specific discretionary handling #1966

Closed

Omikhleia mentioned this pull request May 9, 2024

Bibliography always use curly quotes instead of language-dependent quotes #2024

Open

alerque mentioned this pull request Jun 12, 2024

Export env language var matching document state #2064

Draft

Omikhleia mentioned this pull request Jun 14, 2024

Using qualified language names instead of language code #1631

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept and resolve BCP47 language names (WIP) #1641

Accept and resolve BCP47 language names (WIP) #1641

Omikhleia commented Dec 2, 2022 •

edited

Loading

alerque Dec 2, 2022

Omikhleia Dec 2, 2022

alerque Dec 2, 2022

alerque Dec 2, 2022

Omikhleia Dec 2, 2022 •

edited

Loading

Omikhleia Dec 2, 2022

alerque Dec 2, 2022

alerque Dec 2, 2022

alerque Dec 2, 2022

Omikhleia Dec 2, 2022

alerque Dec 2, 2022

alerque Dec 2, 2022

Omikhleia Dec 2, 2022

alerque Dec 2, 2022

Omikhleia Dec 2, 2022

Omikhleia Dec 2, 2022

alerque Dec 2, 2022

Omikhleia Dec 2, 2022

alerque Dec 2, 2022

Accept and resolve BCP47 language names (WIP) #1641

Are you sure you want to change the base?

Accept and resolve BCP47 language names (WIP) #1641

Conversation

Omikhleia commented Dec 2, 2022 • edited Loading

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Omikhleia Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Omikhleia commented Dec 2, 2022 •

edited

Loading

Omikhleia Dec 2, 2022 •

edited

Loading