Update MARC importer language mapping table #9344

hornc · 2024-05-26T23:53:46Z

Closes #8140

Updates the MARC import table mappings to correct LOC deprecated 3 character language codes to their current code.

This will ensure records imported from older MARCs have the up-to-date codes in Open Library.

Technical

This will not modify language codes on existing records (#8139), it only affects new imports.

Testing

Screenshot

Stakeholders

@cdrini
@tfmorris

closes internetarchive#8140

tfmorris

Looks like this got merged without any of the "stakeholders" review, but here are my review comments.

tfmorris · 2024-05-28T15:37:57Z

openlibrary/catalog/marc/parse.py

    'jap': 'jpn',
    'fra': 'fre',
-    'gwr': 'ger',


gwr looks like it could be a typo fix, but cro and sze seem unlikely. There's a pretty big archive of MARC records which have been imported, so it should be possible to see how frequently (if at all) these are used.

sze looks like it could be an ISO code for the https://en.wikipedia.org/wiki/Seze_language , which has nothing to do with slo, and I think chu -> cro has a similar ambiguity. Without comments, I'm not sure what that mapping is protecting against, but to me it looks like they are more likely to re-assign non-MARC language codes to unrelated languages. I imagine there were some historical records that those changes worked for, but this code can only protect against systematic and likely codes we might encounter through regular older catalog imports.

I guess my natural bias is towards not changing things I don't understand, particularly since the code represents (or should) decades of accumulated knowledge, but I'd be hard pressed to argued for preserving such an ancient bit of cruft.

tfmorris · 2024-05-28T15:41:50Z

openlibrary/catalog/marc/parse.py

    'it ': 'ita',
+    # LOC MARC Deprecated code updates
+    'cam': 'khm',  # Khmer
+    'esp': 'epo',  # Esperanto


Looks like esk (Eskimo) is missing which is one of the codes that drew complaints.

I only added the codes which had a clear one-to-one correct mapping -- your list was super helpful showing the corrected mappings. Many deprecated codes don't have a single obvious mapping, which prevents this kind of automated fix, and there seems to be a range of reasons why a code is deprecated. Some seem technical dialect vs language factors like lan -> oci -- that makes me think some cataloged items might lose information if the item is really in the Languedocien dialect and were cataloged correctly, but now they'd be listed under a family (Occitan), with a time period, which may or may not relate to the item. It could go the other way though. I don't know enough of the details, but that struck me as potentially quite a difference.

I think your advice on what is needed to correct the ~217 esk codes in https://github.com/internetarchive/openlibrary/issues/8733#issuecomment-1901168076 is still good, and I'm not sure it can be automated (without the risk of mis-assigning codes based on naive assumptions).

It might be worth tossing in a comment about esk (and any other unmappable codes) so that people reviewing the list don't think that they were forgotten.

hornc added 2 commits May 27, 2024 11:33

remove uncommented /unjustified replacements (one off typos?)

1fdb221

update MARC import language mapping

fa48f2b

closes internetarchive#8140

hornc added Theme: MARC records Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] labels May 26, 2024

mekarpeles merged commit 3b5ef43 into internetarchive:master May 28, 2024
4 checks passed

mekarpeles self-assigned this May 28, 2024

tfmorris reviewed May 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update MARC importer language mapping table #9344

Update MARC importer language mapping table #9344

hornc commented May 26, 2024

tfmorris left a comment

tfmorris May 28, 2024

hornc May 29, 2024

tfmorris May 31, 2024

tfmorris May 28, 2024

hornc May 29, 2024 •

edited

Loading

tfmorris May 31, 2024

Update MARC importer language mapping table #9344

Update MARC importer language mapping table #9344

Conversation

hornc commented May 26, 2024

Technical

Testing

Screenshot

Stakeholders

tfmorris left a comment

Choose a reason for hiding this comment

tfmorris May 28, 2024

Choose a reason for hiding this comment

hornc May 29, 2024

Choose a reason for hiding this comment

tfmorris May 31, 2024

Choose a reason for hiding this comment

tfmorris May 28, 2024

Choose a reason for hiding this comment

hornc May 29, 2024 • edited Loading

Choose a reason for hiding this comment

tfmorris May 31, 2024

Choose a reason for hiding this comment

hornc May 29, 2024 •

edited

Loading