Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request/Idea: Sanitize languages controlled vocabulary values #8243

Closed
jeromeroucou opened this issue Nov 16, 2021 · 11 comments · Fixed by #10481
Closed

Feature Request/Idea: Sanitize languages controlled vocabulary values #8243

jeromeroucou opened this issue Nov 16, 2021 · 11 comments · Fixed by #10481
Assignees
Labels
Feature: Harvesting Feature: Metadata NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 3 A percentage of a sprint. 2.1 hours. Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc.
Milestone

Comments

@jeromeroucou
Copy link
Contributor

Overview of the Feature Request

In order to improve the content of the proposed languages as a list of controlled values, and to be able to expose them with an identifier later on, we want to modify them by adding the ISO 639-3 code as an alternative value.

Before making a pull request, we would like to have feedback from you on our proposal.

Please note that the language "Bihari" does not have an ISO 639-3 code, but only ISO 639-2 / 5.

A modified data migration script will be required.

Below are the proposed values :

    language    Abkhaz        0    abk
    language    Afar        1    aar
    language    Afrikaans        2    afr
    language    Akan        3    aka
    language    Albanian        4    sqi
    language    Amharic        5    amh
    language    Arabic        6    ara
    language    Aragonese        7    arg    
    language    Armenian        8    hye    
    language    Assamese        9    asm    
    language    Avaric        10    ava    
    language    Avestan        11    ave    
    language    Aymara        12    aym    
    language    Azerbaijani        13    aze    
    language    Bambara        14    bam    
    language    Bashkir        15    bak    
    language    Basque        16    eus    
    language    Belarusian        17    bel    
    language    Bengali, Bangla        18    ben    
    language    Bihari        19    bih 
    language    Bislama        20    bis    
    language    Bosnian        21    bos    
    language    Breton        22    bre    
    language    Bulgarian        23    bul    
    language    Burmese        24    mya    
    language    Catalan, Valencian        25    cat    
    language    Chamorro        26    cha    
    language    Chechen        27    che    
    language    Chichewa, Chewa, Nyanja        28    nya    
    language    Chinese        29    zho    
    language    Church Slavic, Slavonic        30    chu
    language    Chuvash        31    chv
    language    Cornish        32    cor    
    language    Corsican        33    cos    
    language    Cree        34    cre    
    language    Croatian        35    hrv    
    language    Czech        36    ces    
    language    Danish        37    dan    
    language    Divehi, Dhivehi, Maldivian        38    div    
    language    Dutch        39    nld    
    language    Dzongkha        40    dzo    
    language    English        41    eng    
    language    Esperanto        42    epo    
    language    Estonian        43    est    
    language    Ewe        44    ewe    
    language    Faroese        45    fao    
    language    Fijian        46    fij    
    language    Finnish        47    fin    
    language    French        48    fra    
    language    Fula, Fulah        49    ful        
    language    Galician        50    glg    
    language    Ganda        51    lug    
    language    Georgian        52    kat    
    language    German        53    deu    
    language    Greek (modern)        54    ell        
    language    Guarani        55    grn        
    language    Gujarati        56    guj    
    language    Haitian, Haitian Creole        57    hat    
    language    Hausa        58    hau    
    language    Hebrew (modern)        59    heb        
    language    Herero        60    her    
    language    Hindi        61    hin    
    language    Hiri Motu        62    hmo    
    language    Hungarian        63    hun    
    language    Icelandic        64    isl    
    language    Ido        65    ido    
    language    Igbo        66    ibo    
    language    Indonesian        67    ind    
    language    Interlingua        68    ina    
    language    Interlingue        69    ile    
    language    Inuktitut        70    iku    
    language    Inupiaq        71    ipk    
    language    Irish        72    gle    
    language    Italian        73    ita    
    language    Japanese        74    jpn    
    language    Javanese        75    jav    
    language    Kalaallisut, Greenlandic        76    kal        
    language    Kannada        77    kan    
    language    Kanuri        78    kau    
    language    Kashmiri        79    kas    
    language    Kazakh        80    kaz    
    language    Khmer        81    khm    
    language    Kikuyu, Gikuyu        82    kik    
    language    Kinyarwanda        83    kin    
    language    Kirghiz, Kyrgyz        84    kir        
    language    Komi        85    kom    
    language    Kongo        86    kon    
    language    Korean        87    kor    
    language    Kurdish        88    kur    
    language    Kwanyama, Kuanyama        89    kua    
    language    Lao        90    lao    
    language    Latin        91    lat    
    language    Latvian        92    lav    
    language    Limburgish, Limburgan, Limburger        93    lim    
    language    Lingala        94    lin    
    language    Lithuanian        95    lit    
    language    Luba-Katanga        96    lub    
    language    Luxembourgish, Letzeburgesch        97    ltz    
    language    Macedonian        98    mkd    
    language    Malagasy        99    mlg    
    language    Malay (Standard)        100    zsm        
    language    Malay (Central)        101    pse
    language    Malayalam        102    mal    
    language    Maltese        103    mlt    
    language    Manx        104    glv    
    language    Maori        105    mri        
    language    Marathi        106    mar    
    language    Marshallese        107    mah    
    language    Mixtepec Mixtec        108    mix    
    language    Mongolian        109    mon    
    language    Nauru        110    nau    
    language    Navajo, Navaho        111    nav    
    language    Ndonga        112    ndo    
    language    Nepali (macrolanguage)        113    nep    
    language    North Ndebele        114    nde        
    language    Northern Sami        115    sme    
    language    Norwegian        116    nor    
    language    Norwegian Bokmål        117    nob    
    language    Norwegian Nynorsk        118    nno    
    language    Nuosu, Sichuan Yi        119    iii        
    language    Occitan        120    oci    
    language    Ojibwe, Ojibwa        121    oji    
    language    Oriya        122    ori        
    language    Oromo        123    orm    
    language    Ossetian, Ossetic        124    oss    
    language    Pali        125    pli        
    language    Panjabi, Punjabi        126    pan    
    language    Pashto, Pushto        127    pus        
    language    Persian (Farsi)        128    fas        
    language    Polish        129    pol    
    language    Portuguese        130    por    
    language    Pular        131    fuf
    language    Pulaar        132    fuc
    language    Quechua        133    que    
    language    Romanian        134    ron    
    language    Romansh        135    roh    
    language    Rundi, Kirundi        136    run
    language    Russian        137    rus    
    language    Samoan        138    smo    
    language    Sango        139    sag    
    language    Sanskrit        140    san    
    language    Sardinian        141    srd    
    language    Scottish Gaelic, Gaelic        142    gla    
    language    Serbian        143    srp    
    language    Shona        144    sna    
    language    Sindhi        145    snd    
    language    Sinhala, Sinhalese        146    sin    
    language    Slovak        147    slk    
    language    Slovenian        148    slv        
    language    Somali        149    som    
    language    South Ndebele        150    nbl        
    language    Southern Sotho        151    sot    
    language    Spanish, Castilian        152    spa    
    language    Sundanese        153    sun    
    language    Swahili (macrolanguage)        154    swa        
    language    Swati        155    ssw    
    language    Swedish        156    swe    
    language    Tagalog        157    tgl    
    language    Tahitian        158    tah    
    language    Tajik        159    tgk    
    language    Tamil        160    tam    
    language    Tatar        161    tat    
    language    Telugu        162    tel    
    language    Thai        163    tha    
    language    Tibetan Standard, Tibetan, Central        164    bod    
    language    Tigrinya        165    tir    
    language    Tonga (Tonga Islands)        166    ton    
    language    Tsonga        167    tso    
    language    Tswana        168    tsn    
    language    Turkish        169    tur    
    language    Turkmen        170    tuk    
    language    Twi        171    twi    
    language    Ukrainian        172    ukr    
    language    Urdu        173    urd    
    language    Uyghur        174    uig    
    language    Uzbek        175    uzb    
    language    Venda        176    ven    
    language    Vietnamese        177    vie    
    language    Volapük        178    vol    
    language    Walloon        179    wln    
    language    Welsh        180    cym    
    language    Western Frisian        181    fry    
    language    Wolof        182    wol    
    language    Xhosa        183    xho    
    language    Yiddish        184    yid    
    language    Yoruba        185    yor    
    language    Zhuang, Chuang        186    zha    
    language    Zulu        187    zul    
    language    Not applicable        188        

What kind of user is the feature intended for?
API User, Curator, Depositor, and Guest

What inspired the request?
Requirement of archive language metadata

What existing behavior do you want changed?
Improve languages list to be more compliant with ISO standard

Any brand new behavior do you want to add to Dataverse?
None

Any related open or closed issues to this feature request?
Pull request #7690

@stevenmce
Copy link

While extending the list of languages is a good idea, this seems like it may be better handled through an external controlled vocabulary, using the new CV management capabilities.

The list of languages here seems to be a relatively small subset of possible ISO639-3 languages. A quick wikipedia search (in the absence of access to the ISO standard) suggests a broader list of ISO-specified languages exists (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Languages/List_of_ISO_639-3_language_codes_(2019)). As an example, for Australia, we would like to be able to use Aboriginal language groups.

@pdurbin
Copy link
Member

pdurbin commented Oct 5, 2022

Before making a pull request, we would like to have feedback from you on our proposal.

@jeromeroucou sure, this looks good. Please go ahead. Sorry for the slow reply.

While extending the list of languages is a good idea, this seems like it may be better handled through an external controlled vocabulary, using the new CV management capabilities.

@stevenmce yes, I agree in principle. Maybe someday we'll move the language field to an external controlled vocabulary. But for now it probably makes sense to improve the existing internal field. Belt and suspenders, perhaps. 😄

@jeromeroucou if you're not familiar with this new-ish feature, please see https://guides.dataverse.org/en/5.12/admin/metadatacustomization.html#using-external-vocabulary-services

@mreekie
Copy link

mreekie commented Dec 5, 2022

reference

people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:

Added back the laberl: NIH OTA: 1.4.1

Need to touch base with Leonid on this.

@mreekie mreekie added the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Dec 5, 2022
@mreekie mreekie added pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards labels Mar 20, 2023
@cmbz cmbz added the pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues label Jun 2, 2023
@pdurbin pdurbin added Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc. labels Oct 9, 2023
@cmbz
Copy link

cmbz commented Dec 19, 2023

2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.

@landreev landreev added Feature: Metadata Size: 10 A percentage of a sprint. 7 hours. labels Dec 19, 2023
@landreev
Copy link
Contributor

I've added the label "Metadata" - this is really a metadata issue; but Harvesting is affected so, not unreasonable to treat is a harvesting issue as well.

I've given it size 10, under the assumption that this is only about modifying the citation block, and producing a flyway script for any adjustment to the existing values that may be needed.

Please note that the change to the metadata block is going to be less trivial than replacing the existing controlled vocab. values there with the copy-and-pasted "proposed values" above. The 2 will need to be merged, carefully. The list above contains some ISO abbreviations absent from the current block; but the opposite is true for some languages as well - we already support some codes that are not on the list above.
For example:
in the "proposed values" list above:

language    Divehi, Dhivehi, Maldivian        38    div 

in the citation.tsv as currently distributed:

language	Divehi, Dhivehi, Maldivian		37	div	dv

@stevenferey
Copy link
Contributor

Hello,

The initial proposal is no longer up to date (Nov 16, 2021), here is a new one with the associated PR, adapted to Dataverse 6.1, linked to this commit more precisely: 991c5f9

You can provide your comments directly in the PR.
Thanks

@jggautier
Copy link
Contributor

Following up on @stevenmce's earlier comment, #7377 is also related

@DS-INRAE
Copy link
Member

DS-INRAE commented Feb 2, 2024

Just noticed this went into the current sprint, please feel free to take on the draft PR or to indicate us additions if you want us to modify something ;)

@landreev
Copy link
Contributor

I am leaning towards just merging what's in #10197, with just minor additions, possibly.
The separate issue #9992 (for allowing to import harvested metadata with values outside of controlled vocabs) is already in the works.
Then #8578 (waiting to be scheduled) can be used to mop up whatever problems are still not addressed by the combination of the above.

landreev added a commit that referenced this issue Feb 12, 2024
@landreev
Copy link
Contributor

landreev commented Feb 14, 2024

I got gang-pressed into mostly working on something else for the past couple of weeks, but I am still determined to move this along ASAP. I am planning to make a new PR, from my own branch, instead of the draft PR #10197, but I may ask more questions there.

landreev added a commit that referenced this issue Apr 10, 2024
landreev added a commit that referenced this issue Apr 10, 2024
… - I'm leaving the main name intact (so that the block update will still works), but adding both versions as extra alternative names, so that either is importable. #8243
@cmbz cmbz added Size: 3 A percentage of a sprint. 2.1 hours. and removed Size: 10 A percentage of a sprint. 7 hours. labels Apr 10, 2024
landreev added a commit that referenced this issue Apr 10, 2024
landreev added a commit that referenced this issue Apr 10, 2024
…the order in which they are listed in the current ISO 639-3 table) #8243
landreev added a commit that referenced this issue Apr 12, 2024
landreev added a commit that referenced this issue May 15, 2024
…dates easier. Used the first 3-letter code as the identifier for each of the 185 supported languages. #8243
@DS-INRAE
Copy link
Member

👏

@pdurbin pdurbin added this to the 6.3 milestone May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting Feature: Metadata NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 3 A percentage of a sprint. 2.1 hours. Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc.
Projects
Status: Done
9 participants