-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Mangled MARC
Current query: https://openlibrary.org/search?q=title_suggest%3A%C2%A9+AND+ia%3A*&mode=everything
It's been a while since I've looked at this, but the mangling is lossy, and while it looks like the text can be easily repaired in the most common cases, it falls down when trying to fix everything properly. The correct fix will be to re-extract the strings from the MARC source, or compare against the original archive.org record since many (but not all) records are correct now.
Mangled ©♭
represents é
(e-acute), there are other variations but e-acute is the most common form of this problem.
Looking for counts of mangled e-acute records in OL dumps:
zgrep -c "\\\u00a9\\\u266" ol_dump_authors_2023-08-31.txt.gz
Authors: 1,542
zgrep -c "\\\u00a9\\\u266" ol_dump_works_2023-08-31.txt.gz
Works: 8,974
zgrep -c "\\\u00a9\\\u266" ol_dump_editions_2023-08-31.txt.gz
Editions: 11,486
I think the MARC mangling in this way is reversible because the mangling is non-lossy. There may be exceptions. I have a vague recollection that sometimes it is not fully reversible, but that could be another form of MARC mangling, and there are many.
The other issue first reported is of diacritic characters being dropped leaving a space, e.g. https://openlibrary.org/authors/OL4459814A/Heinrich_Schro_der
Heinrich Schro der
for Heinrich Schröder. This example does not have an archive.org scan, and does not appear to have a duplicate work or edition. It's unclear whether the author is duplicated or not since there isn't enough disambiguating info (i.e. dates or ids).
zgrep "[0-9]\s2008" ol_dump_authors_2023-08-31.txt.gz | egrep "[a-z] [a-z]" | grep -v "\\\u" | grep "[a-z] [a-z]"
exclude things like von / de , only search in name fields (bios give false matches)
Highlighting which mangled titles are associated with actual item scans for priority might be a good idea. Presumably mangled strings reducing the discoverability of viewable items has more of a cost that mangled strings affecting viewing of just the metadata record.
The impact of these mangled titles is that the mangled authors/works/editions will be invisible to any matching algorithm (unless they can be matched by strong ids such as ISBN or LCCN -- EXPLORE FURTHER)
The consequence is that future dupes will not be matched and the un-mangled version will just be added as if it were a totally new record. That suggests that by now many of these mangled items will have been reimported correctly.
Louis-Fr©♭d©♭ric-Th©♭odore-Albert Rilliet
Louis-Frederic-Theodore-Albert Rilliet
Yes, duplicate author exists:
https://openlibrary.org/authors/OL7420280A/Louis-Fr%C3%A9d%C3%A9ric-Th%C3%A9odore-Albert_Rilliet
It looks like we have work dupes, but it's not so clear with editions (volumes and multiple scanned items may complicated this)
https://openlibrary.org/works/OL24879904W
Actually 3 vols -- 1,2 appear on the correct author, v3 appears on the mangled.
POSSIBLE FIX: merge mangled author and work + correct v3 and move under correct work.
b21495506_0003 (archive.org item only exists once in the editions data -- the item link is not a dupe)
This example is complicated by multiple volumes in a series and how those should be handled in terms of works. It's possible the work should not be a dupe, but be a clearly marked V.3?
This is unlikely to happen through a MARC import since there is generally one MARC bibliographic record for a multivolume set.
For this example, the two works look to be technically dupe too but have imported separately by being in different languages.
Need some more examples.
Getting Started & Contributing
- Setting up your developer environment
- Using
git
in Open Library - Finding good
First Issues
- Code Recipes
- Testing Your Code, Debugging & Performance Profiling
- Loading Production Site Data ↦ Dev Instance
- Submitting good Pull Requests
- Asking Questions on Gitter Chat
- Joining the Community Slack
- Attending Weekly Community Calls @ 9a PT
- Applying to Google Summer of Code & Fellowship Opportunities
Developer Resources
- FAQs: Frequently Asked Questions
- Front-end Guide: JS, CSS, HTML
- Internationalization
- Infogami & Data Model
- Solr Search Engine Manual
- Imports
- BookWorm / Affiliate Server
- Writing Bots
Developer Guides
- Developing the My Books & Reading Log
- Developing the Books page
- Understanding the "Read" Button
- Using cache
- Creating and Logging into New Users
- Feature Flagging
Other Portals
- Design
- Librarianship
- Communications
- Staff (internal)
Legacy
Old Getting Started
Orphaned Editions Planning
Canonical Books Page