Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy locationSet during build #7226

Merged
merged 15 commits into from
Oct 27, 2022
Merged

Tidy locationSet during build #7226

merged 15 commits into from
Oct 27, 2022

Conversation

peternewman
Copy link
Collaborator

We should use gb consistently, firstly because its the correct/actual location, secondly to make it easier and better compatibility for external tools and finally because it makes searches on nsi.guide more consistent too.

We should use gb consistently, firstly because its the correct/actual location, secondly to make it easier and better compatibility for external tools and finally because it makes searches on nsi.guide more consistent too.
@bhousel
Copy link
Member

bhousel commented Oct 19, 2022

I agree it's probably ok to just have the build script replace 'uk' -> 'gb', but I kind of don't like the idea of adding a config file for this.

@peternewman
Copy link
Collaborator Author

I agree it's probably ok to just have the build script replace 'uk' -> 'gb', but I kind of don't like the idea of adding a config file for this.

Okay, I was just going off the others. There are also theoretically some other mappings that could happen, although I suspect this is by far the most common...

@bhousel
Copy link
Member

bhousel commented Oct 19, 2022

Yeah I think if it's just this one thing, I'd rather have an if in the build script than introduce a bunch more files and config.

Also, from last year I got the sense that people might also prefer to use the more specific QIDs for the UK like on this issue:
rapideditor/country-coder#35 (comment)

@peternewman
Copy link
Collaborator Author

Also, from last year I got the sense that people might also prefer to use the more specific QIDs for the UK like on this issue: ideditor/country-coder#35 (comment)

IMHO that would be a step backwards, especially if it's only done for the UK. E.g. https://github.com/westnordost/osmfeatures currently consumes this data (via the iD presets too, but it doesn't handle the UK nations currently, so anything tagged specifically in England for example doesn't currently show up. It strikes me that heading towards more niche (rather than internationally standardised via ISO) designations isn't going to help matters. Whereas pretty much any language can find out which ISO country code you're in without too much trouble.

@LaoshuBaby
Copy link
Collaborator

I'm ashamed that I've actually never been able to figure out the nuances that exist between gb and uk. At first I think this PR meaning to replace uk in current json file, but looks like it add a new rule to build script, and that's the change attract me most. If possiable I wish to add replace rule zh->cn, ja->jp, ko->kr and vi->vn because I've seen those kind of wrong use of language code(especially in Asian language) many times.

@peternewman
Copy link
Collaborator Author

peternewman commented Oct 22, 2022

I'm ashamed that I've actually never been able to figure out the nuances that exist between gb and uk.

No worries, I hadn't really looked into it much myself in the past despite living here! This diagram shows it quite nicely:
https://commons.wikimedia.org/wiki/File:British_Isles_Venn_Diagram-en.png

So pendantically, gb<uk, however given only one of them is defined in ISO 3166-1:
https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#GB

And the other is just reserved:
https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#UK

The confusion is our inconsistency between our ISO 3166-1 alpha-2 code and country code TLD, which I think is fairly unique.

At first I think this PR meaning to replace uk in current json file,

Yeah I figured that while changing them once would resolve the short term issue, inevitably more would sneak in, hence the scripting...

but looks like it add a new rule to build script, and that's the change attract me most. If possiable I wish to add replace rule zh->cn, ja->jp, ko->kr and vi->vn because I've seen those kind of wrong use of language code(especially in Asian language) many times.

I assume this is people using a language code rather than a country code?

I think your list is 50-50 currently, from what I can see, these are fine as there's no current reservation:
zh->cn, ko->kr

But this is more problematic:
ja->jp, - They could mean Jamaica too, which has reserved the code ja.

And this is just blocked:
vi->vn - Is actually Virgin Islands

So is @bhousel happy for my config file to be reinstated if it's now more complicated...

We could also offer reverse mapping for some misused country codes back to language codes:
cn->zh
jp->ja
vn->vi

Apart from this one, where both exist:
kr->ko

It might make sense to push some of this into config too, given it's the same pairings:

// If the name can only be reasonably read in one country,
// assign `locationSet`, and localize tags like `name:xx`
// https://www.regular-expressions.info/unicode.html
if (/[\u0590-\u05FF]/.test(name)) { // Hebrew
// note: old ISO 639-1 lang code for Hebrew was `iw`, now `he`
if (!item.locationSet) item.locationSet = { include: ['iw'] };
setLanguageTags(tags, 'he');
} else if (/[\u0E00-\u0E7F]/.test(name)) { // Thai
if (!item.locationSet) item.locationSet = { include: ['th'] };
setLanguageTags(tags, 'th');
} else if (/[\u1000-\u109F]/.test(name)) { // Myanmar
if (!item.locationSet) item.locationSet = { include: ['mm'] };
setLanguageTags(tags, 'my');
} else if (/[\u1100-\u11FF]/.test(name)) { // Hangul
if (!item.locationSet) item.locationSet = { include: ['kr'] };
setLanguageTags(tags, 'ko');
} else if (/[\u1700-\u171F]/.test(name)) { // Tagalog
if (!item.locationSet) item.locationSet = { include: ['ph'] };
setLanguageTags(tags, 'tl');
} else if (/[\u3040-\u30FF]/.test(name)) { // Hirgana or Katakana
if (!item.locationSet) item.locationSet = { include: ['jp'] };
setLanguageTags(tags, 'ja');
} else if (/[\u3130-\u318F]/.test(name)) { // Hangul
if (!item.locationSet) item.locationSet = { include: ['kr'] };
setLanguageTags(tags, 'ko');
} else if (/[\uA960-\uA97F]/.test(name)) { // Hangul
if (!item.locationSet) item.locationSet = { include: ['kr'] };
setLanguageTags(tags, 'ko');
} else if (/[\uAC00-\uD7AF]/.test(name)) { // Hangul
if (!item.locationSet) item.locationSet = { include: ['kr'] };
setLanguageTags(tags, 'ko');
} else {
if (!item.locationSet) item.locationSet = { include: ['001'] }; // the whole world
}

It doesn't look like we currently do any validation that language codes are sane, which ought to be fairly easy, or country codes, which might be harder given some of the other stuff supported via location-conflation.

@UKChris-osm UKChris-osm added the enhancement Actionable - add an enhancement to the source code label Oct 23, 2022
@LaoshuBaby
Copy link
Collaborator

But this is more problematic: ja->jp, - They could mean Jamaica too, which has reserved the code ja.

And this is just blocked: vi->vn - Is actually Virgin Islands

Oh, I haven't noticed this, so I think we can only do this check when those langcode also appear in tags:

e.g:

{
"locationSet":{"include":["zh"]},
"tags":{
"brand":"xxx",
"brand:zh":"xxx"
}
}

Because we found a *:zh and zh is the langcode, so we can confirm that the zh in locationSet is wrong use of langcode. because nobody will add a Japanese name in Jamaica?

@bhousel
Copy link
Member

bhousel commented Oct 24, 2022

I think the build script does validate all the locationSets, so all the codes appearing in there should be a valid country code (or something recognized as one).
The list of codes that country-coder supports is here: https://ideditor.codes/

As far as I remember, we don't really validate the language codes, and I'm not sure I want to get too deep into doing that, since this topic seems to be full of special cases and exceptions: https://wiki.openstreetmap.org/wiki/Multilingual_names

@LaoshuBaby said:

Because we found a *:zh and zh is the langcode, so we can confirm that the zh in locationSet is wrong use of langcode. because nobody will add a Japanese name in Jamaica?

I think if someone tries to put a zh in the locationSet include like that, the build script will raise it as an error..

@LaoshuBaby
Copy link
Collaborator

I think if someone tries to put a zh in the locationSet include like that, the build script will raise it as an error..

Emmm……So I guess, our coding philosophy is "let it crash" to expose this error, not try to fix it, even we can infer what is correct……?

OK, I won't request for this checker&auto-fix in build script again

@bhousel
Copy link
Member

bhousel commented Oct 27, 2022

Thanks @peternewman this seems ok 👍

Emmm……So I guess, our coding philosophy is "let it crash" to expose this error, not try to fix it, even we can infer what is correct……?

Yes - I think having the build fail and having a human take a look at why is a good thing. (This is different from what @peternewman wants, where there are several valid codes and he wants to just standardize on one of them.)

@bhousel bhousel merged commit 443ce3b into main Oct 27, 2022
@bhousel bhousel deleted the peternewman-fix-locationset branch October 27, 2022 14:29
@LaoshuBaby LaoshuBaby mentioned this pull request Oct 29, 2022
@peternewman
Copy link
Collaborator Author

I'll admit I hadn't tested this very well, aside from that it compiled, but I'm assuming your run in e9efc61 should have included it @bhousel so it looks like maybe it doesn't actually work currently? 😢 Can anyone spot anything obvious I've missed?

I think the build script does validate all the locationSets
I think if someone tries to put a zh in the locationSet include like that, the build script will raise it as an error..

Confirmed it does (although incidentally the "chalk" doesn't work in GitHub actions it seems, now fixed in #7283 ):
https://github.com/osmlab/name-suggestion-index/actions/runs/3351865802/jobs/5553601255
https://github.com/osmlab/name-suggestion-index/actions/runs/3351865802/jobs/5553601300

We also don't currently have GitHub annotations to flag these errors up to people, I could have a go at adding them if you'd like?

As far as I remember, we don't really validate the language codes, and I'm not sure I want to get too deep into doing that, since this topic seems to be full of special cases and exceptions: https://wiki.openstreetmap.org/wiki/Multilingual_names

I appreciate it can get complicated, but presumably we could e.g. validate that any two character language code was an allocated one, even if we didn't check all the sub-codes?

@peternewman peternewman mentioned this pull request May 9, 2023
peternewman referenced this pull request May 9, 2023
- Add Westfield
- Update countries for TK Maxx (not a typo)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Actionable - add an enhancement to the source code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants