Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MARC importer language mapping table #9344

Merged
merged 2 commits into from
May 28, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 29 additions & 4 deletions openlibrary/catalog/marc/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,14 +279,39 @@ def read_edition_name(rec: MarcBase) -> str:
'end': 'eng',
'enk': 'eng',
'ent': 'eng',
'cro': 'chu',
'jap': 'jpn',
'fra': 'fre',
'gwr': 'ger',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gwr looks like it could be a typo fix, but cro and sze seem unlikely. There's a pretty big archive of MARC records which have been imported, so it should be possible to see how frequently (if at all) these are used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sze looks like it could be an ISO code for the https://en.wikipedia.org/wiki/Seze_language , which has nothing to do with slo, and I think chu -> cro has a similar ambiguity. Without comments, I'm not sure what that mapping is protecting against, but to me it looks like they are more likely to re-assign non-MARC language codes to unrelated languages. I imagine there were some historical records that those changes worked for, but this code can only protect against systematic and likely codes we might encounter through regular older catalog imports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my natural bias is towards not changing things I don't understand, particularly since the code represents (or should) decades of accumulated knowledge, but I'd be hard pressed to argued for preserving such an ancient bit of cruft.

'sze': 'slo',
'fr ': 'fre',
'fle': 'dut', # Flemish -> Dutch
# 2 character to 3 character codes
'fr ': 'fre',
'it ': 'ita',
# LOC MARC Deprecated code updates
'cam': 'khm', # Khmer
'esp': 'epo', # Esperanto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like esk (Eskimo) is missing which is one of the codes that drew complaints.

Copy link
Collaborator Author

@hornc hornc May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added the codes which had a clear one-to-one correct mapping -- your list was super helpful showing the corrected mappings. Many deprecated codes don't have a single obvious mapping, which prevents this kind of automated fix, and there seems to be a range of reasons why a code is deprecated. Some seem technical dialect vs language factors like lan -> oci -- that makes me think some cataloged items might lose information if the item is really in the Languedocien dialect and were cataloged correctly, but now they'd be listed under a family (Occitan), with a time period, which may or may not relate to the item. It could go the other way though. I don't know enough of the details, but that struck me as potentially quite a difference.

I think your advice on what is needed to correct the ~217 esk codes in https://github.com/internetarchive/openlibrary/issues/8733#issuecomment-1901168076 is still good, and I'm not sure it can be automated (without the risk of mis-assigning codes based on naive assumptions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth tossing in a comment about esk (and any other unmappable codes) so that people reviewing the list don't think that they were forgotten.

'eth': 'gez', # Ethiopic
'far': 'fao', # Faroese
'fri': 'fry', # Frisian
'gae': 'gla', # Scottish Gaelic
'gag': 'glg', # Galician
'gal': 'orm', # Oromo
'gua': 'grn', # Guarani
'int': 'ina', # Interlingua (International Auxiliary Language Association)
'iri': 'gle', # Irish
'lan': 'oci', # Occitan (post 1500)
'lap': 'smi', # Sami
'mla': 'mlg', # Malagasy
'mol': 'rum', # Romanian
'sao': 'smo', # Samoan
'scc': 'srp', # Serbian
'scr': 'hrv', # Croatian
'sho': 'sna', # Shona
'snh': 'sin', # Sinhalese
'sso': 'sot', # Sotho
'swz': 'ssw', # Swazi
'tag': 'tgi', # Tagalog
'taj': 'tgk', # Tajik
'tar': 'tat', # Tatar
'tsw': 'tsn', # Tswana
}


Expand Down
Loading