-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Name translation improvements #86
Comments
First of all, in fairness, I appreciate that this is a faithful port of the OpenMapTiles logic. However, the logic, which originally came from the mapnik-german-l10n package, makes some questionable decisions from an internationalization standpoint. It’s unclear to me whether the The Line 58 in 0d727c4
This presents a problem for Vietnamese, which is distributed across all four blocks. Effectively, any Vietnamese word that has a non-level tone gets devowelized, which looks broken to the user:
Granted, “Vietnamese” is separate from “Roman” in terms of the scripts that TrueType fonts declare support for, but most modern fonts do support Vietnamese, including the OpenMapTiles fonts. A more robust filter would be the set of characters whose glyphs are included in these fonts.1 Alternatively, if such tight coupling to the OpenMapTiles fonts is not desired, If a machine-readable ASCII name is desired, instead of simply removing anything that isn’t a letter, it would be better to perform diacritic folding using the Footnotes
|
You stand true to this @1ec5 :) |
When the name contains no “Latin” characters, the code transliterates the text into Latin text using ICU’s planetiler/planetiler-core/src/main/java/com/onthegomap/planetiler/util/Translations.java Lines 122 to 133 in 0d727c4
This transliterator can be rough coming from some languages, because different schemes are often used depending on the source language, region, and use case. For example:
ICU has somewhat more reliable transliterators that require you to know the source language. As in #14 (comment), I think knowing the country the feature is in would be a good first step to improving the quality of these transliterations. At a minimum, detecting the source script would allow you to use script-specific transliterators and apply script-specific adjustments, like removing the diacritics from pinyin. |
Thanks for the incredible detail and pointers @1ec5! The usual use-case I see for On translation/transliteration, I think the preferred solution is for OSM elements to have a latin translation (name:en, name:de, int_name, etc...) - in that case, we won't attempt to transliterate at all. It's just in the case where no latin variation of the name exists that we need to infer it somehow. |
Thanks, that makes sense. The GL style specification even supports rich text labels, so the second line of these bilingual labels can be formatted differently. Unfortunately, explicit translations are much less likely (and not universally accepted in OSM) for more obscure but common features like street names, so there’s still a high potential for |
Courtesy of @jleedev in OSMUS Slack, Wikidata is making an amusing cameo appearance in some places: The ways in question are exquisitely typeset with en dashes, which the Latin detection code apparently regards as non-Latin, so it falls back to whichever While it may seem contrived to put proper typographical characters on street names, the lack of support for them combined with overeager use of name subkeys can affect other features as well. For example, this POI in Germany is named with German quotation marks, lacks Fortunately, the same regular expression syntax in #86 (comment) can avoid these mishaps with some additional character classes: ^[\P{IsLetter}[\p{IsLetter}&&\p{IsLatin}]]+$ This matches anything that isn’t a “letter” in the Unicode character database (i.e., not a letter, ideograph, or modifier letter), as well as anything that is a letter in the Latin script. |
@1ec5 seems like there are a few issues going on here. The most urgent one sounds like the names that aren't meant to be names (wikidata QIDs) showing up as road labels. Do you think it would be a reasonable fix for that to just limit the |
Yes, sorry, I thought you had intended this to be an omnibus ticket about localization issues. It would be cleaner to track them in separate issues in the future. The most robust fix would be to limit the subkeys to those that would be valid BCP 47 codes, like xx and xx-YY and xx-YY-ZZZZ. But limiting it to the languages in the profile should work too. I think the revised non-Latin detection code would still be worth pursuing regardless. With just the smaller fix you’re suggesting, there will be cases where an alternative language’s name is arbitrarily chosen just because of a “non-Latin” character. |
No worries! I did intend this to be an omnibus ticket, just trying to see if I can extract isolated issues from it to work on. I'll give those a shot in #146 |
For Japanese names, the situation is even worse since the Pinyin romanization is not just controversial but incomprehensible. The following should have been transliterated as Mukogasaki Koen (or Mukogasaki Park, 向ヶ崎公園): Romanizing Japanese is nontrivial though, and ICU doesn't support it AFAICT. We'd need to use a morphological analyzer, like kuromoji (which is far from perfect and would give "kōgasakikōen" in this example). |
For many languages, there are keys such as Wikidata also has a variety of properties to indicate the transliteration of a place name, though the long-term approach would be to look at the transliterations stored in lexicographical data. |
Hi, just finding this issue. For place names in Japanese I've found this Python converter to be really great. https://github.com/polm/cutlet |
LanguageUtils is a straight port from openmaptiles logic, but there are a few issues with it. Please add a comment with any suggestions for improving the logic to assign element names, picking latin/nonlatin names, transliterating, etc. along with some example test cases to illustrate the desired behavior.
The text was updated successfully, but these errors were encountered: