Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent (non-MARC) imports are adding deprecated language codes (presumably via language name lookups, not just old codes in the import data) #9504

Open
hornc opened this issue Jun 30, 2024 · 13 comments · May be fixed by #9651
Assignees
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]

Comments

@hornc
Copy link
Collaborator

hornc commented Jun 30, 2024

Problem

https://openlibrary.org/books/OL51818714M/Yederasiw_Mastawesha

is a recently imported item that picked up the deprecated Ethiopian language code (the metadata has since been updated), it looks like the language code lookups, converting from language name to a code are using a list of codes with deprecated duplicates, so the resulting code may be the deprecated one (it's probably arbitrary depending on which is listed first?)

How to fix:
The Name -> code lookup list should only contain current item codes.

This relates to the 'duplicates in the language drop down list' issue that I thought I saw recently, but cannot find it now. The dropdown and import translation list should both only contain current language codes.

Perhaps the language code config should have a deprecated parameter, and these can be excluded as needed.

Relates to #9002 in that the example shows at least BWB sourced import are using language lookups.

The specific code to change is:
https://github.com/internetarchive/openlibrary/pull/9488/files

@hornc hornc added Type: Bug Something isn't working. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] labels Jun 30, 2024
@mekarpeles mekarpeles added Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Needs: Lead labels Jul 1, 2024
@mekarpeles
Copy link
Member

@hornc can you propose a priority for this based on your use cases? Is this happening at a large scale (e.g. how many records being affected)? Is this blocking one of our systems/processes? This would help us prioritize accordingly

@mekarpeles mekarpeles added Priority: 3 Issues that we can consider at our leisure. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Jul 1, 2024
@AbhinavKRN
Copy link

@scottbarnes can you assign this issue to me?

@scottbarnes
Copy link
Collaborator

I have assigned this to you, @AbhinavKRN. Please ask any questions if you get stuck anywhere.

@AbhinavKRN
Copy link

Sure @scottbarnes on it.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jul 28, 2024
@scottbarnes scottbarnes removed the Needs: Response Issues which require feedback from lead label Jul 28, 2024
@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jul 31, 2024
@scottbarnes scottbarnes removed the Needs: Response Issues which require feedback from lead label Jul 31, 2024
@hornc
Copy link
Collaborator Author

hornc commented Aug 12, 2024

So, I think this is a relatively low priority issue because I have a bot task that runs weekly to correct deprecated language codes to their current codes (if one exists).

To do this properly, we might want to think a bit about what is supposed to happen in the various cases.

What should happen in the following cases:

  1. an import record contains the deprecated /languages/eth code?
  2. an import record contains the deprecated /languages/esk code?

I was hoping someone would find and link the related "duplicate languages in dropdowns" issue, as that has similar requirements for extending the language code model, which I think is necessary to add this functionality.

Optional language fields we might need to add:

deprecated: /type/boolean
deprecated_note: /type/string (a human readable description to indicate why this is deprecated and point to the preferred alternative, if there is one- i.e. use a more specific code (not-automatable), use a different code,
current: /type/language (a current language to use instead, if this code is deprecated, and there is an automatic preferred version.)

Note: some deprecated codes may not have a clear single value for current

I'm not completely happy with the current terminology, but I can't think of a better term at the moment. Anyone have any ideas for better naming?

@hornc hornc added the Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] label Aug 12, 2024
@hornc
Copy link
Collaborator Author

hornc commented Aug 12, 2024

I think #8145 was perhaps the issue I remember, which touches on duplicate names. Is there a clearer one?

@hornc
Copy link
Collaborator Author

hornc commented Aug 12, 2024

@cdrini having #8160 merged would bring us up-to-date with some of the previous language code issues that have already been raised, discussed, and addressed, so we can build on them here. Is there something blocking the merge of #8160 ?

@scottbarnes scottbarnes added Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Aug 12, 2024
@scottbarnes
Copy link
Collaborator

@hornc, I had hoped we could discuss this during the Monday ABC call, but somehow it was missed during triage. I added this to the agenda for the coming week.

@cdrini
Copy link
Collaborator

cdrini commented Aug 19, 2024

Howdy! Stumbled on this thanks to @RayBB ; taking a look at #8160

@cdrini
Copy link
Collaborator

cdrini commented Aug 19, 2024

@hornc merged! Although I will note I'm not too sure why #8160 would help with deprecated languages 🤔 But leaving that up to you!

@hornc
Copy link
Collaborator Author

hornc commented Aug 25, 2024

I just found this code that translates already translates deprecated language codes:

# LOC MARC Deprecated code updates
# Only covers deprecated codes where there
# is a direct 1-to-1 mapping to a single new code.
'cam': 'khm', # Khmer
'esp': 'epo', # Esperanto
'eth': 'gez', # Ethiopic
'far': 'fao', # Faroese
'fri': 'fry', # Frisian
'gae': 'gla', # Scottish Gaelic
'gag': 'glg', # Galician
'gal': 'orm', # Oromo
'gua': 'grn', # Guarani
'int': 'ina', # Interlingua (International Auxiliary Language Association)
'iri': 'gle', # Irish
'lan': 'oci', # Occitan (post 1500)
'lap': 'smi', # Sami
'mla': 'mlg', # Malagasy
'mol': 'rum', # Romanian
'sao': 'smo', # Samoan
'scc': 'srp', # Serbian
'scr': 'hrv', # Croatian
'sho': 'sna', # Shona
'snh': 'sin', # Sinhalese
'sso': 'sot', # Sotho
'swz': 'ssw', # Swazi
'tag': 'tgl', # Tagalog
'taj': 'tgk', # Tajik
'tar': 'tat', # Tatar
'tsw': 'tsn', # Tswana
}

I had been thinking this (and the related removing deprecated language codes from the edition edit dropdown) required an update to the /type/language model . I looks like this could be fixed in code using the existing method.

@hornc
Copy link
Collaborator Author

hornc commented Aug 27, 2024

It looks like MARC imports use the hardcoded deprecated language code tables in openlibrary/openlibrary/catalog/marc/parse.py , but imports from other sources do not.

#9809 is an attempt to consolidate the deprecations into the language code type , so there should be an opportunity to consolidate the imports, and perhaps remove the special-case translations?

@hornc hornc changed the title Recent imports are adding deprecated language codes (presumably via language name lookups, not just old codes in the import data) Recent (non-MARC) imports are adding deprecated language codes (presumably via language name lookups, not just old codes in the import data) Aug 27, 2024
@scottbarnes scottbarnes removed Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Sep 23, 2024
@scottbarnes
Copy link
Collaborator

@AbhinavKRN, are you still interested in working on this issue? If not I will open it back for others who may wish to work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants