-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update MARC importer language mapping table #9344
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this got merged without any of the "stakeholders" review, but here are my review comments.
'jap': 'jpn', | ||
'fra': 'fre', | ||
'gwr': 'ger', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gwr
looks like it could be a typo fix, but cro
and sze
seem unlikely. There's a pretty big archive of MARC records which have been imported, so it should be possible to see how frequently (if at all) these are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sze
looks like it could be an ISO code for the https://en.wikipedia.org/wiki/Seze_language , which has nothing to do with slo
, and I think chu
-> cro
has a similar ambiguity. Without comments, I'm not sure what that mapping is protecting against, but to me it looks like they are more likely to re-assign non-MARC language codes to unrelated languages. I imagine there were some historical records that those changes worked for, but this code can only protect against systematic and likely codes we might encounter through regular older catalog imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my natural bias is towards not changing things I don't understand, particularly since the code represents (or should) decades of accumulated knowledge, but I'd be hard pressed to argued for preserving such an ancient bit of cruft.
'it ': 'ita', | ||
# LOC MARC Deprecated code updates | ||
'cam': 'khm', # Khmer | ||
'esp': 'epo', # Esperanto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like esk
(Eskimo) is missing which is one of the codes that drew complaints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added the codes which had a clear one-to-one correct mapping -- your list was super helpful showing the corrected mappings. Many deprecated codes don't have a single obvious mapping, which prevents this kind of automated fix, and there seems to be a range of reasons why a code is deprecated. Some seem technical dialect vs language factors like lan
-> oci
-- that makes me think some cataloged items might lose information if the item is really in the Languedocien dialect and were cataloged correctly, but now they'd be listed under a family (Occitan), with a time period, which may or may not relate to the item. It could go the other way though. I don't know enough of the details, but that struck me as potentially quite a difference.
I think your advice on what is needed to correct the ~217 esk
codes in https://github.com/internetarchive/openlibrary/issues/8733#issuecomment-1901168076 is still good, and I'm not sure it can be automated (without the risk of mis-assigning codes based on naive assumptions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth tossing in a comment about esk
(and any other unmappable codes) so that people reviewing the list don't think that they were forgotten.
Closes #8140
Updates the MARC import table mappings to correct LOC deprecated 3 character language codes to their current code.
This will ensure records imported from older MARCs have the up-to-date codes in Open Library.
Technical
This will not modify language codes on existing records (#8139), it only affects new imports.
Testing
Screenshot
Stakeholders
@cdrini
@tfmorris