-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop one canonical, vetted, reviewed, set of character lists per language #89
Comments
On http://www.cheapprofonts.com/Languages.php there is a chart that may be On Mon, Jun 6, 2016 at 12:28 PM, Dave Crossland notifications@github.com
Jason Pagura |
cc @brawer |
cc @twardoch |
cc @MrBrezina |
@ultrasquid, did you know that Unicode CLDR collects which characters are used in what language? For each language, it has four sets of “exemplar characters”. The main set is shown here; the auxiliary set here. Admittedly, the charts could be made a little more readable, but CLDR is a nice central place for collecting this information. If anything is missing or wrong, just file a CLDR ticket. From CLDR, the exemplar character information flows into libraries such as ICU, which is built into many systems. But you can also take the data directly from the source XML files. Search for “exemplar”, for example in the French data. |
My two cents. What needs to be established first is a clear definition of the set of characters used by certain language. I know at least two approaches and I am sure an educate linguist would come up with more and perhaps more precise:
I think that the difference between (2) and (1) is what CLDR calls auxiliary which seems like a good approach. In terms of what is useful to save for a single language, I thought of this.
Note: it is just a preliminary example. I do not know whether the data is correct. A while ago, I thought some indication of the shaping would be useful as some of the combinations are required, but might not have a codepoint. But I am not sure if this is not too much. Perhaps just noting that a feature (in this case @ultrasquid I found at least one mistake in Czech, so I would be careful about this list. Useful resources (reliability varies):
|
cc: @moyogo |
CLDR is not terribly reliable, at least not for Arabic. Its list of Arabic characters is lacking important characters, while rarely used characters are present. See for example w3c/alreq#49. |
Putting on my Unicode hat for a sec: Please, please, please report bugs to CLDR so we can fix them. |
CLDR should be the place for information on characters used by locales. A lot of checks can be derived from the characters and character sequences in the exemplars. But in many cases that is not sufficient. There’s actually more information that font producers would want to be able to refer to when testing the coverage of their fonts. Glyph shape or position variation information is out of the scope of Unicode and the CLDR, yet it is a crucial part of proper locale support. Having a character doesn’t mean a font supports the languages using that character. At the same time some of these requirements are style specific and may not apply to every style. But I digress... In any case, it might be useful to make a fork of the CLDR character exemplar data, expand and modify it with references and push the fixes upstream. |
Huerta Tipo have released comparison sites for Devanagari, Cyrillic and
Greek, I think this descriptivist approach might be more helpful than a
prescriptivist guide :)
|
@davelab6 The Cyrillic comparison is something I have developed as a fork from Huerta Tipo's projects locally. Sorry, it still isn't publicly available, as I am extending it, and fixing tech issues. |
With regards to @davelab6 suggestions @moyogo comments. (Sorry if I am stating the obvious here.) Absolutely agreed that there is more to language support than a list of codepoints. However, part of it has to stay in the domain of type design (appropriating shapes) and type use (using these shapes) for the time being. We do not have tools and methodologies to distinguish essential and non-essential in the shapes (think structure vs. style). And if we cannot do that, we cannot say that some shape complies with expectations and some do not. And even if we had, it would depend on more variables than just style. It also depends on whom you talk to (e.g. Polish kreska or Bulgarian Cyrillic discussions). Moreover, the preferences keep on changing and any kind of rules are being broken in amazing ways in specific contexts. So there is no way we can tackle language support completely at the moment. I think. To digress even more and to take Central European languages as an example. There are too many (even awarded) typefaces which include the right codepoints, even readable shapes you could say, but so badly executed that a great majority of professional Czech designers would be really disappointed if they had to use them. So what I think we are looking for here is an automated way to diagnose fonts for language support potential based on Unicode codepoints. Nothing more. It is important to be aware of the limits. The question is how do we go about that and where do we draw the line. Personally, I think including some indication of required features is a good idea (so users get a red flag and can go: “Aha, I need something else to be there. I need to research a bit.”), also perhaps some notes. Maybe just the notes. I am not sure if describing the features is all that useful anymore. It adds too much complexity. See, what I do not know is how to tackle things like accents positions (those which are not codified in Unicode in precomposed form), e.g. for ways of writing Yoruba, or conjuncts for Indic languages. Do we just say that there need to be particular features and leave it up to the user to clarify whether the support is there? |
@graphicore here's the list of languages I'm most interested in: Afrikaans |
How does this relate to https://github.com/rosettatype/langs-db? Would it be better to "bridge" to Rosetta's YAML file and auto-instantiate charset objects from that? |
Humm. Does Rosetta's really not enable issue-tracker? @MrBrezina At any rate, whichever is deemed more canonical, I'd love to merge it with fontconfig's database and make fontconfig generate from it... |
Quoting @behdad
The text was updated successfully, but these errors were encountered: