Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop one canonical, vetted, reviewed, set of character lists per language #89

Open
davelab6 opened this issue Jun 6, 2016 · 18 comments

Comments

@davelab6
Copy link
Member

davelab6 commented Jun 6, 2016

Quoting @behdad

Fontconfig has a minimal character set for each language, and is well-tested.

For font work, we need both a minimal set, as well as a "nice to have" set, which is used when making subsets (ie. include them if they are in the font).

And then separate data for digits, currency signs, rare marks, etc.

Start with CLDR and add missing data there, and fix bugs around it.

@ultrasquid
Copy link

On http://www.cheapprofonts.com/Languages.php there is a chart that may be
a useful reference. It lists characters and corresponding Unicode numbers
needed for many languages that use the Latin alphabet, though there are
some major omissions (notably Vietnamese). Certainly incomplete, but one
must start somewhere.

On Mon, Jun 6, 2016 at 12:28 PM, Dave Crossland notifications@github.com
wrote:

Fontconfig has a minimal character set for each language, and is
well-tested.

For font work, we need both a minimal set, as well as a "nice to have"
set, which is used when making subsets (ie. include them if they are in the
font).

And then separate data for digits, currency signs, rare marks, etc.

Start with CLDR and add missing data there, and fix bugs around it.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#89, or mute the thread
https://github.com/notifications/unsubscribe/ABtEiYSWOk3iz7lmDhhR4bQlldHOKyO7ks5qJHTYgaJpZM4IvOwH
.

Jason Pagura
zimbach at gmail dot com

@behdad
Copy link

behdad commented Jun 6, 2016

cc @brawer

@behdad
Copy link

behdad commented Jun 6, 2016

cc @twardoch

@behdad
Copy link

behdad commented Jun 7, 2016

cc @MrBrezina

@brawer
Copy link

brawer commented Jun 8, 2016

@ultrasquid, did you know that Unicode CLDR collects which characters are used in what language? For each language, it has four sets of “exemplar characters”. The main set is shown here; the auxiliary set here. Admittedly, the charts could be made a little more readable, but CLDR is a nice central place for collecting this information. If anything is missing or wrong, just file a CLDR ticket. From CLDR, the exemplar character information flows into libraries such as ICU, which is built into many systems. But you can also take the data directly from the source XML files. Search for “exemplar”, for example in the French data.

@MrBrezina
Copy link

My two cents. What needs to be established first is a clear definition of the set of characters used by certain language. I know at least two approaches and I am sure an educate linguist would come up with more and perhaps more precise:

  1. official language alphabet
  2. characters that appear (frequently) in texts in text of the language, i.e., also characters used in family names of foreign origin etc.

I think that the difference between (2) and (1) is what CLDR calls auxiliary which seems like a good approach.

In terms of what is useful to save for a single language, I thought of this.

<language iso-639-2="?" name="Hunzib" script="Cyrl" status="todo" opentype-tag="?">
    <characters type="required">АБВГДЕӘЖЗИЙКЛМНОПРСТУӮФХЦЧШЪЫЫЬЭӀабвгдеәжзийклмнопрстуӯфхцчшъыыьэӏ</characters>
    <characters type="recommended" note="punctuation">‹›«»…</characters>
    <shaping type="required">
        <feature opentype-tag="mark">
            <bases>АЕӘОЭаеәоэ</bases>
            <marks>̄</marks>
        </feature>
    </shaping>
</language>

Note: it is just a preliminary example. I do not know whether the data is correct. A while ago, I thought some indication of the shaping would be useful as some of the combinations are required, but might not have a codepoint. But I am not sure if this is not too much. Perhaps just noting that a feature (in this case mark) needs to exist would be sufficient. I am aware this is OpenType specific, but there is no reason this could not include other formats in the future. Perhaps TTX format would be better for that.

@ultrasquid I found at least one mistake in Czech, so I would be careful about this list.

Useful resources (reliability varies):

  • Unicode CLDR (as @brawer just pointed out), but I hear the data is not in a very good shape
  • WebINK Character Sets
  • EasySpeak by Typekit
  • Latin Plus by Underware: http://www.underware.nl/latin_plus/
  • Data on Languages: Institut of Esthonian Language http://www.eki.ee/letter
  • at Rosetta we have also have, currently a bit random, collection of language definitions we would be happy to share or develop further

@MrBrezina
Copy link

cc: @moyogo

@khaledhosny
Copy link
Contributor

CLDR is not terribly reliable, at least not for Arabic. Its list of Arabic characters is lacking important characters, while rarely used characters are present. See for example w3c/alreq#49.

@brawer
Copy link

brawer commented Jun 8, 2016

CLDR is not terribly reliable

Putting on my Unicode hat for a sec: Please, please, please report bugs to CLDR so we can fix them.

@moyogo
Copy link

moyogo commented Jun 8, 2016

CLDR should be the place for information on characters used by locales. A lot of checks can be derived from the characters and character sequences in the exemplars. But in many cases that is not sufficient.

There’s actually more information that font producers would want to be able to refer to when testing the coverage of their fonts. Glyph shape or position variation information is out of the scope of Unicode and the CLDR, yet it is a crucial part of proper locale support. Having a character doesn’t mean a font supports the languages using that character. At the same time some of these requirements are style specific and may not apply to every style. But I digress...

In any case, it might be useful to make a fork of the CLDR character exemplar data, expand and modify it with references and push the fixes upstream.

@davelab6
Copy link
Member Author

davelab6 commented Jun 8, 2016 via email

@alexeiva
Copy link

@davelab6 The Cyrillic comparison is something I have developed as a fork from Huerta Tipo's projects locally. Sorry, it still isn't publicly available, as I am extending it, and fixing tech issues.

@MrBrezina
Copy link

MrBrezina commented Jun 22, 2016

With regards to @davelab6 suggestions @moyogo comments. (Sorry if I am stating the obvious here.) Absolutely agreed that there is more to language support than a list of codepoints. However, part of it has to stay in the domain of type design (appropriating shapes) and type use (using these shapes) for the time being. We do not have tools and methodologies to distinguish essential and non-essential in the shapes (think structure vs. style). And if we cannot do that, we cannot say that some shape complies with expectations and some do not. And even if we had, it would depend on more variables than just style. It also depends on whom you talk to (e.g. Polish kreska or Bulgarian Cyrillic discussions). Moreover, the preferences keep on changing and any kind of rules are being broken in amazing ways in specific contexts. So there is no way we can tackle language support completely at the moment. I think.

To digress even more and to take Central European languages as an example. There are too many (even awarded) typefaces which include the right codepoints, even readable shapes you could say, but so badly executed that a great majority of professional Czech designers would be really disappointed if they had to use them.

So what I think we are looking for here is an automated way to diagnose fonts for language support potential based on Unicode codepoints. Nothing more. It is important to be aware of the limits. The question is how do we go about that and where do we draw the line. Personally, I think including some indication of required features is a good idea (so users get a red flag and can go: “Aha, I need something else to be there. I need to research a bit.”), also perhaps some notes. Maybe just the notes. I am not sure if describing the features is all that useful anymore. It adds too much complexity.

See, what I do not know is how to tackle things like accents positions (those which are not codified in Unicode in precomposed form), e.g. for ways of writing Yoruba, or conjuncts for Indic languages. Do we just say that there need to be particular features and leave it up to the user to clarify whether the support is there?

@davelab6
Copy link
Member Author

@graphicore here's the list of languages I'm most interested in:

Afrikaans
Albanian
Arabic
Azerbaijani
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Filipino
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Kazakh
Kyrgyz
Latvian
Lithuanian
Macedonian
Malay
Marathi
Mongolian
Nepali
Norwegian (Bokmål)
Persian
Polish
Portuguese
Portuguese (European)
Romanian
Russian
Serbian
Serbian (Latin)
Slovak
Slovenian
Spanish
Spanish (Latin America)
Swahili
Swedish
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese

@simoncozens
Copy link
Contributor

How does this relate to https://github.com/rosettatype/langs-db? Would it be better to "bridge" to Rosetta's YAML file and auto-instantiate charset objects from that?

@behdad
Copy link

behdad commented Sep 10, 2020

cc @matthiasclasen

@behdad
Copy link

behdad commented Sep 10, 2020

Humm. Does Rosetta's really not enable issue-tracker? @MrBrezina

At any rate, whichever is deemed more canonical, I'd love to merge it with fontconfig's database and make fontconfig generate from it...

@MrBrezina
Copy link

@behdad I have activated now. :) We did not consider it quite ready.

btw. we renamed it to Hyperglot today (Langs DB was too general) and @kontur refactored the tool and tests for the new structure of the database. We plan to add more languages in the next few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants