Tidy locationSet during build #7226

peternewman · 2022-10-19T14:44:33Z

We should use gb consistently, firstly because its the correct/actual location, secondly to make it easier and better compatibility for external tools and finally because it makes searches on nsi.guide more consistent too.

bhousel · 2022-10-19T14:50:39Z

I agree it's probably ok to just have the build script replace 'uk' -> 'gb', but I kind of don't like the idea of adding a config file for this.

peternewman · 2022-10-19T14:57:57Z

I agree it's probably ok to just have the build script replace 'uk' -> 'gb', but I kind of don't like the idea of adding a config file for this.

Okay, I was just going off the others. There are also theoretically some other mappings that could happen, although I suspect this is by far the most common...

bhousel · 2022-10-19T15:03:25Z

Yeah I think if it's just this one thing, I'd rather have an if in the build script than introduce a bunch more files and config.

Also, from last year I got the sense that people might also prefer to use the more specific QIDs for the UK like on this issue:
rapideditor/country-coder#35 (comment)

peternewman · 2022-10-19T15:23:43Z

Also, from last year I got the sense that people might also prefer to use the more specific QIDs for the UK like on this issue: ideditor/country-coder#35 (comment)

IMHO that would be a step backwards, especially if it's only done for the UK. E.g. https://github.com/westnordost/osmfeatures currently consumes this data (via the iD presets too, but it doesn't handle the UK nations currently, so anything tagged specifically in England for example doesn't currently show up. It strikes me that heading towards more niche (rather than internationally standardised via ISO) designations isn't going to help matters. Whereas pretty much any language can find out which ISO country code you're in without too much trouble.

This reverts commit 5227d30.

This reverts commit 0a79842.

This reverts commit 0c57083.

This reverts commit aac4fa1.

This reverts commit 4daf7f4.

This reverts commit b9fc077.

LaoshuBaby · 2022-10-20T19:54:39Z

I'm ashamed that I've actually never been able to figure out the nuances that exist between gb and uk. At first I think this PR meaning to replace uk in current json file, but looks like it add a new rule to build script, and that's the change attract me most. If possiable I wish to add replace rule zh->cn, ja->jp, ko->kr and vi->vn because I've seen those kind of wrong use of language code(especially in Asian language) many times.

peternewman · 2022-10-22T14:20:09Z

I'm ashamed that I've actually never been able to figure out the nuances that exist between gb and uk.

No worries, I hadn't really looked into it much myself in the past despite living here! This diagram shows it quite nicely:
https://commons.wikimedia.org/wiki/File:British_Isles_Venn_Diagram-en.png

So pendantically, gb<uk, however given only one of them is defined in ISO 3166-1:
https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#GB

And the other is just reserved:
https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#UK

The confusion is our inconsistency between our ISO 3166-1 alpha-2 code and country code TLD, which I think is fairly unique.

At first I think this PR meaning to replace uk in current json file,

Yeah I figured that while changing them once would resolve the short term issue, inevitably more would sneak in, hence the scripting...

but looks like it add a new rule to build script, and that's the change attract me most. If possiable I wish to add replace rule zh->cn, ja->jp, ko->kr and vi->vn because I've seen those kind of wrong use of language code(especially in Asian language) many times.

I assume this is people using a language code rather than a country code?

I think your list is 50-50 currently, from what I can see, these are fine as there's no current reservation:
zh->cn, ko->kr

But this is more problematic:
ja->jp, - They could mean Jamaica too, which has reserved the code ja.

And this is just blocked:
vi->vn - Is actually Virgin Islands

So is @bhousel happy for my config file to be reinstated if it's now more complicated...

We could also offer reverse mapping for some misused country codes back to language codes:
cn->zh
jp->ja
vn->vi

Apart from this one, where both exist:
kr->ko

It might make sense to push some of this into config too, given it's the same pairings:

name-suggestion-index/scripts/build_index.js

Lines 482 to 515 in 03b3934

    
           // If the name can only be reasonably read in one country, 
        
           // assign `locationSet`, and localize tags like `name:xx` 
        
           // https://www.regular-expressions.info/unicode.html 
        
           if (/[\u0590-\u05FF]/.test(name)) {          // Hebrew 
        
             // note: old ISO 639-1 lang code for Hebrew was `iw`, now `he` 
        
             if (!item.locationSet)  item.locationSet = { include: ['iw'] }; 
        
             setLanguageTags(tags, 'he'); 
        
           } else if (/[\u0E00-\u0E7F]/.test(name)) {   // Thai 
        
             if (!item.locationSet)  item.locationSet = { include: ['th'] }; 
        
             setLanguageTags(tags, 'th'); 
        
           } else if (/[\u1000-\u109F]/.test(name)) {   // Myanmar 
        
             if (!item.locationSet)  item.locationSet = { include: ['mm'] }; 
        
             setLanguageTags(tags, 'my'); 
        
           } else if (/[\u1100-\u11FF]/.test(name)) {   // Hangul 
        
             if (!item.locationSet)  item.locationSet = { include: ['kr'] }; 
        
             setLanguageTags(tags, 'ko'); 
        
           } else if (/[\u1700-\u171F]/.test(name)) {   // Tagalog 
        
             if (!item.locationSet)  item.locationSet = { include: ['ph'] }; 
        
             setLanguageTags(tags, 'tl'); 
        
           } else if (/[\u3040-\u30FF]/.test(name)) {   // Hirgana or Katakana 
        
             if (!item.locationSet)  item.locationSet = { include: ['jp'] }; 
        
             setLanguageTags(tags, 'ja'); 
        
           } else if (/[\u3130-\u318F]/.test(name)) {   // Hangul 
        
             if (!item.locationSet)  item.locationSet = { include: ['kr'] }; 
        
             setLanguageTags(tags, 'ko'); 
        
           } else if (/[\uA960-\uA97F]/.test(name)) {   // Hangul 
        
             if (!item.locationSet)  item.locationSet = { include: ['kr'] }; 
        
             setLanguageTags(tags, 'ko'); 
        
           } else if (/[\uAC00-\uD7AF]/.test(name)) {   // Hangul 
        
             if (!item.locationSet)  item.locationSet = { include: ['kr'] }; 
        
             setLanguageTags(tags, 'ko'); 
        
           } else { 
        
             if (!item.locationSet)  item.locationSet = { include: ['001'] };   // the whole world 
        
           }

It doesn't look like we currently do any validation that language codes are sane, which ought to be fairly easy, or country codes, which might be harder given some of the other stuff supported via location-conflation.

LaoshuBaby · 2022-10-23T17:14:41Z

But this is more problematic: ja->jp, - They could mean Jamaica too, which has reserved the code ja.

And this is just blocked: vi->vn - Is actually Virgin Islands

Oh, I haven't noticed this, so I think we can only do this check when those langcode also appear in tags:

e.g:

{
"locationSet":{"include":["zh"]},
"tags":{
"brand":"xxx",
"brand:zh":"xxx"
}
}

Because we found a *:zh and zh is the langcode, so we can confirm that the zh in locationSet is wrong use of langcode. because nobody will add a Japanese name in Jamaica?

bhousel · 2022-10-24T13:33:48Z

I think the build script does validate all the locationSets, so all the codes appearing in there should be a valid country code (or something recognized as one).
The list of codes that country-coder supports is here: https://ideditor.codes/

As far as I remember, we don't really validate the language codes, and I'm not sure I want to get too deep into doing that, since this topic seems to be full of special cases and exceptions: https://wiki.openstreetmap.org/wiki/Multilingual_names

@LaoshuBaby said:

Because we found a *:zh and zh is the langcode, so we can confirm that the zh in locationSet is wrong use of langcode. because nobody will add a Japanese name in Jamaica?

I think if someone tries to put a zh in the locationSet include like that, the build script will raise it as an error..

LaoshuBaby · 2022-10-24T14:05:03Z

I think if someone tries to put a zh in the locationSet include like that, the build script will raise it as an error..

Emmm……So I guess, our coding philosophy is "let it crash" to expose this error, not try to fix it, even we can infer what is correct……?

OK, I won't request for this checker&auto-fix in build script again

bhousel · 2022-10-27T14:29:19Z

Thanks @peternewman this seems ok 👍

Emmm……So I guess, our coding philosophy is "let it crash" to expose this error, not try to fix it, even we can infer what is correct……?

Yes - I think having the build fail and having a human take a look at why is a good thing. (This is different from what @peternewman wants, where there are several valid codes and he wants to just standardize on one of them.)

peternewman · 2022-10-29T15:14:11Z

I'll admit I hadn't tested this very well, aside from that it compiled, but I'm assuming your run in e9efc61 should have included it @bhousel so it looks like maybe it doesn't actually work currently? 😢 Can anyone spot anything obvious I've missed?

I think the build script does validate all the locationSets
I think if someone tries to put a zh in the locationSet include like that, the build script will raise it as an error..

Confirmed it does (although incidentally the "chalk" doesn't work in GitHub actions it seems, now fixed in #7283 ):
https://github.com/osmlab/name-suggestion-index/actions/runs/3351865802/jobs/5553601255
https://github.com/osmlab/name-suggestion-index/actions/runs/3351865802/jobs/5553601300

We also don't currently have GitHub annotations to flag these errors up to people, I could have a go at adding them if you'd like?

As far as I remember, we don't really validate the language codes, and I'm not sure I want to get too deep into doing that, since this topic seems to be full of special cases and exceptions: https://wiki.openstreetmap.org/wiki/Multilingual_names

I appreciate it can get complicated, but presumably we could e.g. validate that any two character language code was an allocated one, even if we didn't check all the sub-codes?

- Add Westfield - Update countries for TK Maxx (not a typo)

peternewman added 2 commits October 19, 2022 15:44

Tidy locationSet during build

1960794

We should use gb consistently, firstly because its the correct/actual location, secondly to make it easier and better compatibility for external tools and finally because it makes searches on nsi.guide more consistent too.

Add config to map uk to gb in locationSets

b9fc077

peternewman added 3 commits October 19, 2022 15:51

Add countryReplacements JSON schema

4daf7f4

Ship the countryReplacements dist file too

aac4fa1

Add the vscode settings

0c57083

Fix the syntax and make sure it's all done as lower case

0203b9f

Finish file-based support for now

0a79842

peternewman and others added 8 commits October 19, 2022 16:24

Fix the outstanding JSON syntax issues

5227d30

Revert "Fix the outstanding JSON syntax issues"

6f06d20

This reverts commit 5227d30.

Revert "Finish file-based support for now"

4c8ed2e

This reverts commit 0a79842.

Revert "Add the vscode settings"

613ff4f

This reverts commit 0c57083.

Revert "Ship the countryReplacements dist file too"

68f6330

This reverts commit aac4fa1.

Revert "Add countryReplacements JSON schema"

cb3da8b

This reverts commit 4daf7f4.

Revert "Add config to map uk to gb in locationSets"

6454c5a

This reverts commit b9fc077.

Pull the countryReplacement data back into the build script

c46333d

UKChris-osm added the enhancement Actionable - add an enhancement to the source code label Oct 23, 2022

bhousel merged commit 443ce3b into main Oct 27, 2022

bhousel deleted the peternewman-fix-locationset branch October 27, 2022 14:29

LaoshuBaby mentioned this pull request Oct 29, 2022

Fix GitHub Chalk #7283

Merged

peternewman mentioned this pull request May 9, 2023

Fix locationset fix #8121

Merged

peternewman referenced this pull request May 9, 2023

Update department_store.json

82fda4e

- Add Westfield - Update countries for TK Maxx (not a typo)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tidy locationSet during build #7226

Tidy locationSet during build #7226

peternewman commented Oct 19, 2022

bhousel commented Oct 19, 2022

peternewman commented Oct 19, 2022

bhousel commented Oct 19, 2022 •

edited

Loading

peternewman commented Oct 19, 2022

LaoshuBaby commented Oct 20, 2022

peternewman commented Oct 22, 2022 •

edited

Loading

LaoshuBaby commented Oct 23, 2022

bhousel commented Oct 24, 2022

LaoshuBaby commented Oct 24, 2022

bhousel commented Oct 27, 2022

peternewman commented Oct 29, 2022

Tidy locationSet during build #7226

Tidy locationSet during build #7226

Conversation

peternewman commented Oct 19, 2022

bhousel commented Oct 19, 2022

peternewman commented Oct 19, 2022

bhousel commented Oct 19, 2022 • edited Loading

peternewman commented Oct 19, 2022

LaoshuBaby commented Oct 20, 2022

peternewman commented Oct 22, 2022 • edited Loading

LaoshuBaby commented Oct 23, 2022

bhousel commented Oct 24, 2022

LaoshuBaby commented Oct 24, 2022

bhousel commented Oct 27, 2022

peternewman commented Oct 29, 2022

bhousel commented Oct 19, 2022 •

edited

Loading

peternewman commented Oct 22, 2022 •

edited

Loading