Develop one canonical, vetted, reviewed, set of character lists per language #89

davelab6 · 2016-06-06T19:28:24Z

Fontconfig has a minimal character set for each language, and is well-tested.

For font work, we need both a minimal set, as well as a "nice to have" set, which is used when making subsets (ie. include them if they are in the font).

And then separate data for digits, currency signs, rare marks, etc.

Start with CLDR and add missing data there, and fix bugs around it.

ultrasquid · 2016-06-06T19:45:17Z

On http://www.cheapprofonts.com/Languages.php there is a chart that may be
a useful reference. It lists characters and corresponding Unicode numbers
needed for many languages that use the Latin alphabet, though there are
some major omissions (notably Vietnamese). Certainly incomplete, but one
must start somewhere.

On Mon, Jun 6, 2016 at 12:28 PM, Dave Crossland notifications@github.com
wrote:

Fontconfig has a minimal character set for each language, and is
well-tested.

For font work, we need both a minimal set, as well as a "nice to have"
set, which is used when making subsets (ie. include them if they are in the
font).

And then separate data for digits, currency signs, rare marks, etc.

Start with CLDR and add missing data there, and fix bugs around it.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#89, or mute the thread
https://github.com/notifications/unsubscribe/ABtEiYSWOk3iz7lmDhhR4bQlldHOKyO7ks5qJHTYgaJpZM4IvOwH
.

Jason Pagura
zimbach at gmail dot com

behdad · 2016-06-06T20:34:15Z

cc @brawer

behdad · 2016-06-06T20:34:22Z

cc @twardoch

behdad · 2016-06-07T21:35:41Z

cc @MrBrezina

brawer · 2016-06-08T12:27:21Z

@ultrasquid, did you know that Unicode CLDR collects which characters are used in what language? For each language, it has four sets of “exemplar characters”. The main set is shown here; the auxiliary set here. Admittedly, the charts could be made a little more readable, but CLDR is a nice central place for collecting this information. If anything is missing or wrong, just file a CLDR ticket. From CLDR, the exemplar character information flows into libraries such as ICU, which is built into many systems. But you can also take the data directly from the source XML files. Search for “exemplar”, for example in the French data.

MrBrezina · 2016-06-08T12:59:05Z

My two cents. What needs to be established first is a clear definition of the set of characters used by certain language. I know at least two approaches and I am sure an educate linguist would come up with more and perhaps more precise:

official language alphabet
characters that appear (frequently) in texts in text of the language, i.e., also characters used in family names of foreign origin etc.

I think that the difference between (2) and (1) is what CLDR calls auxiliary which seems like a good approach.

In terms of what is useful to save for a single language, I thought of this.

<language iso-639-2="?" name="Hunzib" script="Cyrl" status="todo" opentype-tag="?">
    <characters type="required">АБВГДЕӘЖЗИЙКЛМНОПРСТУӮФХЦЧШЪЫЫЬЭӀабвгдеәжзийклмнопрстуӯфхцчшъыыьэӏ</characters>
    <characters type="recommended" note="punctuation">‹›«»…</characters>
    <shaping type="required">
        <feature opentype-tag="mark">
            <bases>АЕӘОЭаеәоэ</bases>
            <marks>̄</marks>
        </feature>
    </shaping>
</language>

Note: it is just a preliminary example. I do not know whether the data is correct. A while ago, I thought some indication of the shaping would be useful as some of the combinations are required, but might not have a codepoint. But I am not sure if this is not too much. Perhaps just noting that a feature (in this case mark) needs to exist would be sufficient. I am aware this is OpenType specific, but there is no reason this could not include other formats in the future. Perhaps TTX format would be better for that.

@ultrasquid I found at least one mistake in Czech, so I would be careful about this list.

Useful resources (reliability varies):

Unicode CLDR (as @brawer just pointed out), but I hear the data is not in a very good shape
WebINK Character Sets
EasySpeak by Typekit
Latin Plus by Underware: http://www.underware.nl/latin_plus/
Data on Languages: Institut of Esthonian Language http://www.eki.ee/letter
at Rosetta we have also have, currently a bit random, collection of language definitions we would be happy to share or develop further

MrBrezina · 2016-06-08T13:00:52Z

cc: @moyogo

khaledhosny · 2016-06-08T13:28:46Z

CLDR is not terribly reliable, at least not for Arabic. Its list of Arabic characters is lacking important characters, while rarely used characters are present. See for example w3c/alreq#49.

brawer · 2016-06-08T13:53:56Z

CLDR is not terribly reliable

Putting on my Unicode hat for a sec: Please, please, please report bugs to CLDR so we can fix them.

moyogo · 2016-06-08T20:04:03Z

CLDR should be the place for information on characters used by locales. A lot of checks can be derived from the characters and character sequences in the exemplars. But in many cases that is not sufficient.

There’s actually more information that font producers would want to be able to refer to when testing the coverage of their fonts. Glyph shape or position variation information is out of the scope of Unicode and the CLDR, yet it is a crucial part of proper locale support. Having a character doesn’t mean a font supports the languages using that character. At the same time some of these requirements are style specific and may not apply to every style. But I digress...

In any case, it might be useful to make a fork of the CLDR character exemplar data, expand and modify it with references and push the fixes upstream.

davelab6 · 2016-06-08T20:08:17Z

Huerta Tipo have released comparison sites for Devanagari, Cyrillic and Greek, I think this descriptivist approach might be more helpful than a prescriptivist guide :)

alexeiva · 2016-06-22T03:11:21Z

@davelab6 The Cyrillic comparison is something I have developed as a fork from Huerta Tipo's projects locally. Sorry, it still isn't publicly available, as I am extending it, and fixing tech issues.

MrBrezina · 2016-06-22T09:51:16Z

With regards to @davelab6 suggestions @moyogo comments. (Sorry if I am stating the obvious here.) Absolutely agreed that there is more to language support than a list of codepoints. However, part of it has to stay in the domain of type design (appropriating shapes) and type use (using these shapes) for the time being. We do not have tools and methodologies to distinguish essential and non-essential in the shapes (think structure vs. style). And if we cannot do that, we cannot say that some shape complies with expectations and some do not. And even if we had, it would depend on more variables than just style. It also depends on whom you talk to (e.g. Polish kreska or Bulgarian Cyrillic discussions). Moreover, the preferences keep on changing and any kind of rules are being broken in amazing ways in specific contexts. So there is no way we can tackle language support completely at the moment. I think.

To digress even more and to take Central European languages as an example. There are too many (even awarded) typefaces which include the right codepoints, even readable shapes you could say, but so badly executed that a great majority of professional Czech designers would be really disappointed if they had to use them.

So what I think we are looking for here is an automated way to diagnose fonts for language support potential based on Unicode codepoints. Nothing more. It is important to be aware of the limits. The question is how do we go about that and where do we draw the line. Personally, I think including some indication of required features is a good idea (so users get a red flag and can go: “Aha, I need something else to be there. I need to research a bit.”), also perhaps some notes. Maybe just the notes. I am not sure if describing the features is all that useful anymore. It adds too much complexity.

See, what I do not know is how to tackle things like accents positions (those which are not codified in Unicode in precomposed form), e.g. for ways of writing Yoruba, or conjuncts for Indic languages. Do we just say that there need to be particular features and leave it up to the user to clarify whether the support is there?

davelab6 · 2017-05-25T05:08:14Z

@graphicore here's the list of languages I'm most interested in:

Afrikaans
Albanian
Arabic
Azerbaijani
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Filipino
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Kazakh
Kyrgyz
Latvian
Lithuanian
Macedonian
Malay
Marathi
Mongolian
Nepali
Norwegian (Bokmål)
Persian
Polish
Portuguese
Portuguese (European)
Romanian
Russian
Serbian
Serbian (Latin)
Slovak
Slovenian
Spanish
Spanish (Latin America)
Swahili
Swedish
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese

simoncozens · 2020-09-03T10:56:30Z

How does this relate to https://github.com/rosettatype/langs-db? Would it be better to "bridge" to Rosetta's YAML file and auto-instantiate charset objects from that?

behdad · 2020-09-10T17:23:01Z

cc @matthiasclasen

behdad · 2020-09-10T17:24:38Z

Humm. Does Rosetta's really not enable issue-tracker? @MrBrezina

At any rate, whichever is deemed more canonical, I'd love to merge it with fontconfig's database and make fontconfig generate from it...

MrBrezina · 2020-09-10T18:35:28Z

@behdad I have activated now. :) We did not consider it quite ready.

btw. we renamed it to Hyperglot today (Langs DB was too general) and @kontur refactored the tool and tests for the new structure of the database. We plan to add more languages in the next few weeks.

simoncozens mentioned this issue Oct 9, 2020

Bridge to the hyperglot library #117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop one canonical, vetted, reviewed, set of character lists per language #89

Develop one canonical, vetted, reviewed, set of character lists per language #89

davelab6 commented Jun 6, 2016 •

edited

Loading

ultrasquid commented Jun 6, 2016

behdad commented Jun 6, 2016

behdad commented Jun 6, 2016

behdad commented Jun 7, 2016

brawer commented Jun 8, 2016

MrBrezina commented Jun 8, 2016

MrBrezina commented Jun 8, 2016

khaledhosny commented Jun 8, 2016

brawer commented Jun 8, 2016

moyogo commented Jun 8, 2016

davelab6 commented Jun 8, 2016 via email

alexeiva commented Jun 22, 2016

MrBrezina commented Jun 22, 2016 •

edited

Loading

davelab6 commented May 25, 2017

simoncozens commented Sep 3, 2020

behdad commented Sep 10, 2020

behdad commented Sep 10, 2020

MrBrezina commented Sep 10, 2020

Develop one canonical, vetted, reviewed, set of character lists per language #89

Develop one canonical, vetted, reviewed, set of character lists per language #89

Comments

davelab6 commented Jun 6, 2016 • edited Loading

ultrasquid commented Jun 6, 2016

behdad commented Jun 6, 2016

behdad commented Jun 6, 2016

behdad commented Jun 7, 2016

brawer commented Jun 8, 2016

MrBrezina commented Jun 8, 2016

MrBrezina commented Jun 8, 2016

khaledhosny commented Jun 8, 2016

brawer commented Jun 8, 2016

moyogo commented Jun 8, 2016

davelab6 commented Jun 8, 2016 via email

alexeiva commented Jun 22, 2016

MrBrezina commented Jun 22, 2016 • edited Loading

davelab6 commented May 25, 2017

simoncozens commented Sep 3, 2020

behdad commented Sep 10, 2020

behdad commented Sep 10, 2020

MrBrezina commented Sep 10, 2020

davelab6 commented Jun 6, 2016 •

edited

Loading

MrBrezina commented Jun 22, 2016 •

edited

Loading