Re-enable non-book filtering on importbot non-MARC imports #6284
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
relates to #4151
re-enables filtering obvious non-books from some non-MARC record imports.
I'm not sure why this was disabled. There should be some logging that appears in the task that performs these imports
from the
f"{self.primary_format} is NONBOOK"
message. If there is an import identifier, that should probably be added to the logs so we can see whether this is working correctly.The list of formats this list excludes is:
Which should be uncontroversial. If this is preventing too many imports, the data should be re-checked for appropriateness.
Technical
Testing
Have added a test to show how this prevents correctly described blank notebooks from being imported -- many of which have already been added to OL without this basic checking:
https://openlibrary.org/search?q=title%3A+%22Moleskine+Cahier%22&mode=everything
Moleskine seems to be a reputable publisher that correctly distinguishes their notebooks in basic metadata.
Not all of author https://openlibrary.org/authors/OL3186674A/Moleskine books are notebooks, https://openlibrary.org/works/OL21075128W/Grafton_Architects looks like a real design book with a WorldCat entry: https://www.worldcat.org/title/grafton-architects-inspiration-and-process-in-architecture/oclc/909366078 so some care should be taken with clean up too.
This will also stop bookmarks:
https://openlibrary.org/works/OL21568824W/Indigo_Magnetic_Bookmarks
and dolls:
https://openlibrary.org/works/OL17411486W/Darth_Vader_In_A_Box_Together_We_Can_Rule_The_Galaxy
tote bags:
https://openlibrary.org/works/OL25718266W/Secret_Garden_BabyLit_Tote
Pens:
https://openlibrary.org/works/OL21137719W/Bright_Ideas_-_20_Double-Ended_Colored_Brush_Pens
Mugs:
https://openlibrary.org/works/OL20336373W/Keep_Calm_and_Hang_On_Mug
Tattoos? (I'm not sure exactly what this is, it doesn't seem obvious after import that it's not a book, but the publisher metadata states it is "Merchandise, Other" ):
https://openlibrary.org/works/OL21139184W/There%27s_No_Place_Like_Home
T-shirts: (test added)
AUS49852633|1423639103||9781423639107|US|I|TS||||I Like Big Books T-Shirt X-Large|||1 vol.|||||||20141201|Gibbs Smith, Publisher|DE|X||||||||||||||ENG||0.280|27.940|22.860|2.540||||||T|||||||20748||||||326333|AUD|39.99||||||||||||||||||||||||||||||||||||||||||||||||||||||||||BIP,OTH|49852633|35|9781423639107|49099247|||19801468|||||||Gibbs Smith, Publisher||1||||||||||||||||NON000000|||WZ|||
https://openlibrary.org/works/OL25716783W/I_Like_Big_Books_T-Shirt_X-Large
Most of this author / publishers items seem like other merch: https://openlibrary.org/authors/OL6831238A/Gibbs_Smith_Publisher
Puzzles:
https://openlibrary.org/search?q=title%3A+%221000+Piece%22&mode=everything
https://openlibrary.org/search?q=title%3A+%22500+Piece%22&mode=everything
https://openlibrary.org/search?q=title%3A+%22300+Piece%22&mode=everything
not exhaustive, just some obvious title matches
Origami paper:
https://openlibrary.org/search?q=title%3A+%22Origami+Paper%22&mode=everything
Some of these origami papers (the ones with images?) were originally imported from Amazon, but reimported multiple times by Importbot e.g. https://openlibrary.org/books/OL7931200M/Origami_Paper_Dots
There's years of random non-books products which have been imported here.
I'm not a fan of the indiscriminate importing of bookseller data like this by Import Bot. Some basic checking up front when the metadata is just sitting there is best. Tidying up some of these after the fact is going to be hard. These are just some random examples I was able to find from looking at a small subset of the input. This original PR was meant to be a minimal attempt to catch the worst non-books. There are plenty of notebooks which will sneak past this filter due to poor (possibly deliberately misleading) categorisation. I don't understand why it was disabled.
Bookseller data needs more quality checks than library MARC imports, but currently Open Library has much less and appears to be deliberately taking a quantity over quality approach.
Screenshot
Stakeholders
@mekarpeles @seabelis @LeadSongDog @cdrini