Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name translation improvements #86

Open
msbarry opened this issue Feb 24, 2022 · 13 comments
Open

Name translation improvements #86

msbarry opened this issue Feb 24, 2022 · 13 comments

Comments

@msbarry
Copy link
Contributor

msbarry commented Feb 24, 2022

LanguageUtils is a straight port from openmaptiles logic, but there are a few issues with it. Please add a comment with any suggestions for improving the logic to assign element names, picking latin/nonlatin names, transliterating, etc. along with some example test cases to illustrate the desired behavior.

@1ec5
Copy link

1ec5 commented Feb 24, 2022

First of all, in fairness, I appreciate that this is a faithful port of the OpenMapTiles logic. However, the logic, which originally came from the mapnik-german-l10n package, makes some questionable decisions from an internationalization standpoint.

It’s unclear to me whether the name:latin field is intended to be a sanitized string of the sort that’s used for file names and URL slugs, or whether it’s intended to be presented to the user as a somehow more legible version of name for speakers of languages written in a Latin alphabet. If the latter, the regular expression should more closely match the set of characters supported by clients (such as Mapbox GL JS or MapLibre GL JS) and the fonts that are most likely to be used with them.

The LETTER regular expression explicitly matches the Basic Latin, Latin-1 Supplement, Latin Extended-A, and Latin Extended-B blocks of Unicode, but it excludes Latin Extended Additional (among other Latin blocks):

private static final Pattern LETTER = Pattern.compile("[A-Za-zÀ-ÖØ-öø-ÿĀ-ɏ]+");

This presents a problem for Vietnamese, which is distributed across all four blocks. Effectively, any Vietnamese word that has a non-level tone gets devowelized, which looks broken to the user:

Hiệp Phước
“Hiệp Phước” becomes “Hip Phc”, and “Tôn Đức Thắng” becomes “Tôn Đc Thng”.

Granted, “Vietnamese” is separate from “Roman” in terms of the scripts that TrueType fonts declare support for, but most modern fonts do support Vietnamese, including the OpenMapTiles fonts. A more robust filter would be the set of characters whose glyphs are included in these fonts.1 Alternatively, if such tight coupling to the OpenMapTiles fonts is not desired, ^\p{IsLatin}+$ would be a simple but effective replacement for the current containsOnlyLatinCharacters() method, and \p{IsLetter} could replace LETTER.

If a machine-readable ASCII name is desired, instead of simply removing anything that isn’t a letter, it would be better to perform diacritic folding using the Latin-ASCII ICU transform or some other case folding library. Thus, “Hiệp Phước” would become “Hiep Phuoc”.

Footnotes

  1. Tangentially, it might be useful for name to hide characters in scripts that MapLibre GL JS has a particular problem laying out even with the RTL plugin installed: Full "complex text" support: indic scripts, ligatures, kerning, etc. mapbox/mapbox-gl-js#4009.

@wipfli
Copy link
Contributor

wipfli commented Feb 24, 2022

Ardent defender of diacritics everywhere

You stand true to this @1ec5 :)

@1ec5
Copy link

1ec5 commented Feb 24, 2022

When the name contains no “Latin” characters, the code transliterates the text into Latin text using ICU’s Any-Latin transliterator:

private static final Transliterator TO_LATIN_TRANSLITERATOR = Transliterator.getInstance("Any-Latin");
/**
* Attempts to translate non-latin characters to latin characters that preserve the <em>sound</em> of the word (as
* opposed to translation which attempts to preserve meaning) using ICU4j.
* <p>
* NOTE: This can be expensive and transliteration is synchronized deep down in ICU4j internals which limits benefit
* of running in multiple threads, so exhaust all other options first.
*/
public static String transliterate(String input) {
return input == null ? null : TO_LATIN_TRANSLITERATOR.transform(input);
}

This transliterator can be rough coming from some languages, because different schemes are often used depending on the source language, region, and use case. For example:

Kiev
Cyrillic text is transliterated to Latin according to the ISO 9 standard, which is less biased toward a particular language but still differs from the more common transliteration schemes used in each language or country, as seen here in these Ukrainian park names.

Shei-Pa
Chinese text appears to be transliterated to Latin using Hanyu Pinyin. A map intended for lay readers would remove diacritics and spaces between syllables of compound words from these pinyin transliterations. Additionally, in Taiwan, the use of Hanyu Pinyin versus Tongyong Pinyin is a partisan and regional matter. (These names all end in words like “Ecological Protection Area” that would ideally be translated rather than transliterated.)

ICU has somewhat more reliable transliterators that require you to know the source language. As in #14 (comment), I think knowing the country the feature is in would be a good first step to improving the quality of these transliterations. At a minimum, detecting the source script would allow you to use script-specific transliterators and apply script-specific adjustments, like removing the diacritics from pinyin.

@msbarry
Copy link
Contributor Author

msbarry commented Feb 25, 2022

Thanks for the incredible detail and pointers @1ec5! The usual use-case I see for name:latin and name:nonlatin is to provide dual labels when the local name is nonlatin, for example check out the style demos on https://stadiamaps.com/ , like:

image

On translation/transliteration, I think the preferred solution is for OSM elements to have a latin translation (name:en, name:de, int_name, etc...) - in that case, we won't attempt to transliterate at all. It's just in the case where no latin variation of the name exists that we need to infer it somehow.

@1ec5
Copy link

1ec5 commented Feb 25, 2022

Thanks, that makes sense. The GL style specification even supports rich text labels, so the second line of these bilingual labels can be formatted differently. Unfortunately, explicit translations are much less likely (and not universally accepted in OSM) for more obscure but common features like street names, so there’s still a high potential for name:latin to show up even if name:en, int_name, etc. are preferred over it.

@1ec5
Copy link

1ec5 commented Feb 25, 2022

Courtesy of @jleedev in OSMUS Slack, Wikidata is making an amusing cameo appearance in some places:

QIDs

The ways in question are exquisitely typeset with en dashes, which the Latin detection code apparently regards as non-Latin, so it falls back to whichever name:* it can find that contains only Latin characters. It just happens that these ways are also tagged with name:etymology:wikidata, which is guaranteed to be set to an ASCII-only value. Some name:* subkeys don’t identify a language but instead refine the name somehow. Other common examples include name:signed, name:prefix, and name:pronunciation (which is “non-Latin” IPA anyways).

While it may seem contrived to put proper typographical characters on street names, the lack of support for them combined with overeager use of name subkeys can affect other features as well. For example, this POI in Germany is named with German quotation marks, lacks name:en or name:de, and could conceivably be tagged with name:etymology:wikidata=Q217964.

Fortunately, the same regular expression syntax in #86 (comment) can avoid these mishaps with some additional character classes: ^[\p{IsLatin}\p{IsPunctuation}]+$. A lot can be done by combining character classes. If the goal is merely to filter out linguistic content that’s in a different writing system, filtering on the general category and script should do the trick:

^[\P{IsLetter}[\p{IsLetter}&&\p{IsLatin}]]+$

This matches anything that isn’t a “letter” in the Unicode character database (i.e., not a letter, ideograph, or modifier letter), as well as anything that is a letter in the Latin script.

@msbarry msbarry added the bug Something isn't working label Mar 8, 2022
@msbarry
Copy link
Contributor Author

msbarry commented Mar 23, 2022

@1ec5 seems like there are a few issues going on here. The most urgent one sounds like the names that aren't meant to be names (wikidata QIDs) showing up as road labels. Do you think it would be a reasonable fix for that to just limit the name:<language> tags that get checked to the languages the profile is using? For example in openmaptiles:

https://github.com/openmaptiles/openmaptiles/blob/8693822d506076d1cbf0d777d40d3a0a12986ce6/openmaptiles.yaml#L30-L99

@1ec5
Copy link

1ec5 commented Mar 23, 2022

Yes, sorry, I thought you had intended this to be an omnibus ticket about localization issues. It would be cleaner to track them in separate issues in the future.

The most robust fix would be to limit the subkeys to those that would be valid BCP 47 codes, like xx and xx-YY and xx-YY-ZZZZ. But limiting it to the languages in the profile should work too.

I think the revised non-Latin detection code would still be worth pursuing regardless. With just the smaller fix you’re suggesting, there will be cases where an alternative language’s name is arbitrarily chosen just because of a “non-Latin” character.

@msbarry
Copy link
Contributor Author

msbarry commented Mar 24, 2022

No worries! I did intend this to be an omnibus ticket, just trying to see if I can extract isolated issues from it to work on. I'll give those a shot in #146

@gebner
Copy link

gebner commented Jun 12, 2022

Chinese text appears to be transliterated to Latin using Hanyu Pinyin. A map intended for lay readers would remove diacritics and spaces between syllables of compound words from these pinyin transliterations. Additionally, in Taiwan, the use of Hanyu Pinyin versus Tongyong Pinyin is a partisan and regional matter. (These names all end in words like “Ecological Protection Area” that would ideally be translated rather than transliterated.)

For Japanese names, the situation is even worse since the Pinyin romanization is not just controversial but incomprehensible. The following should have been transliterated as Mukogasaki Koen (or Mukogasaki Park, 向ヶ崎公園):

mukogasakikouen

Romanizing Japanese is nontrivial though, and ICU doesn't support it AFAICT. We'd need to use a morphological analyzer, like kuromoji (which is far from perfect and would give "kōgasakikōen" in this example).

@jleedev
Copy link

jleedev commented Jun 12, 2022

There are also useless values in name_de and name_en for some reason.

Screenshot_20220612-095135

@1ec5
Copy link

1ec5 commented Jul 2, 2022

On translation/transliteration, I think the preferred solution is for OSM elements to have a latin translation (name:en, name:de, int_name, etc...) - in that case, we won't attempt to transliterate at all. It's just in the case where no latin variation of the name exists that we need to infer it somehow.

For many languages, there are keys such as name:ko-Latn and name:sr-Latn that allow mappers to choose the transliteration system most appropriate to a given language. Ideally the tiles would include those language-qualified property names, because name:latin is just as ambiguous as name.

Wikidata also has a variety of properties to indicate the transliteration of a place name, though the long-term approach would be to look at the transliterations stored in lexicographical data.

@j9d3it
Copy link

j9d3it commented Dec 2, 2024

Hi, just finding this issue. For place names in Japanese I've found this Python converter to be really great. https://github.com/polm/cutlet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants