Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research why printdisabled Archive.org items not in Open Library #1047

Closed
mekarpeles opened this issue Aug 6, 2018 · 4 comments
Closed

Research why printdisabled Archive.org items not in Open Library #1047

mekarpeles opened this issue Aug 6, 2018 · 4 comments
Assignees
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Data Cleanup Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Type: Subtask of Epic A subtask that is part of the work breakdown of an epic issue (see comments). [managed]

Comments

@mekarpeles
Copy link
Member

mekarpeles commented Aug 6, 2018

We have ~755,879 (admin) 78,704 (incognito) archive.org items in printdisabled which don't have openlibrary or openlibrary_edition IDs:

https://archive.org/search.php?query=collection%3Aprintdisabled%20AND%20-openlibrary_edition%3A%2A%20AND%20-openlibrary%3A%2A

We need to Update/Create Open Library editions which are missing Archive IDs.

Updated query:

Since not all printdisabled items are necessarily published books with good quality metadata, limiting the scope to items with ISBNs will give us better quality imports into Open Library:

https://archive.org/search.php?query=collection%3Aprintdisabled%20AND%20NOT%20collection%3Ainlibrary%20AND%20NOT%20openlibrary_edition%3A%2A%20AND%20isbn%3A%2A

This gives ~330,110 items (admin) that are printdisabled only that should be imported / linked to Open Library records.

See wiki page https://github.com/internetarchive/openlibrary/wiki/archive.org-%E2%86%94-Open-Library-synchronisation for information on IA ↔ OL synchronisation.

Solution

Loop over the archive.org items whose ocaids are missing in Open Library, take their ISBNs and or titles of these archive.org items and search for them in Open Library.

If a corresponding Open Library edition exists for that ISBN, then write in the ocaid on the Open Library edition. If the Open Library edition is an orphan, then we are going to do a dummy-edit so that a work is created, and then perform a writeback to Archive.org so the openlibrary_edition and openlibrary_work are created.

We should do this using the Open Library Client (not the import bot)

@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Data Cleanup sync-ia-ol Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] labels Aug 6, 2018
@LeadSongDog
Copy link

LeadSongDog commented Aug 15, 2018

Hey, @mekarpeles I'm not certain if it's related, but searching
https://openlibrary.org/search?q=Anna+Donizetti&mode=everything
finds many readable ocaids that have single-edition works, all of which should have been merged under https://openlibrary.org/works/OL2215609W
which has none readable.

The same is true for
https://openlibrary.org/search?q=Don+Pasquale+Donizetti&mode=everything
or perhaps more precisely
https://openlibrary.org/search?q=%22Don+Pasquale%22+OL284141A&mode=everything

[Revised 16 August]
This last finds 1 work with 37 editions, 13 works with 1 edition, and 3 naked editions misfiled as /works/
https://openlibrary.org/works/OL19115192M
https://openlibrary.org/works/OL24188824M
https://openlibrary.org/works/OL8800695M
Of those, there are 8 readable editions. No readables are among the 37 under the main work, six are among the 13 single-edition works, and one ( https://openlibrary.org/works/OL24188824M ) is among the 3 misfiled naked editions.

I'm not sure if you consider the 13 to be "orphans", as they are under extant-but-unnecessary work records.

@hornc
Copy link
Collaborator

hornc commented Apr 22, 2019

similar to #732 , but this is for non-inlibrary items

@hornc
Copy link
Collaborator

hornc commented Apr 22, 2019

IA client commands and OL datadump checking:

# programatic way to search for printdisabled items without olids:
ia search "collection:printdisabled AND NOT openlibrary_edition:* AND NOT openlibrary:*" --itemlist > printdiabled-no-olid.lst
# Check OL data dump for entries which contain those ocaids:
grep -Ff printdisabled-no-olid.lst /storage/openlibrary/ol_dump_editions_2019-03-31.txt > printdisabled-ocaid-matches.jsonl

@hornc hornc changed the title Sync Archive.org IDs that aren't in Open Library Sync Archive.org printdisabled IDs that aren't in Open Library Apr 22, 2019
@brad2014 brad2014 added the Affects: Data Issues that affect book/author metadata or user/account data. [managed] label May 4, 2019
@hornc hornc removed the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Jun 5, 2019
@mekarpeles mekarpeles added the Type: Subtask of Epic A subtask that is part of the work breakdown of an epic issue (see comments). [managed] label Jul 1, 2019
@mekarpeles
Copy link
Member Author

I'm converting this to a research task for #732 Sync OpenLibrary.org ↔ Archive.org Identifiers and closing the issue as it's being answered in https://github.com/internetarchive/openlibrary/wiki/archive.org-%E2%86%94-Open-Library-synchronisation.

@mekarpeles mekarpeles changed the title Sync Archive.org printdisabled IDs that aren't in Open Library Research why printdisabled Arch IDs that aren't in Open Library Jul 1, 2019
@mekarpeles mekarpeles changed the title Research why printdisabled Arch IDs that aren't in Open Library Research why printdisabled Archive.org items not in Open Library Jul 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Data Cleanup Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Type: Subtask of Epic A subtask that is part of the work breakdown of an epic issue (see comments). [managed]
Projects
None yet
Development

No branches or pull requests

4 participants