Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable non-book filtering on importbot non-MARC imports #6284

Merged
merged 6 commits into from
Mar 29, 2022

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Mar 14, 2022

relates to #4151

re-enables filtering obvious non-books from some non-MARC record imports.

I'm not sure why this was disabled. There should be some logging that appears in the task that performs these imports
from the f"{self.primary_format} is NONBOOK" message. If there is an import identifier, that should probably be added to the logs so we can see whether this is working correctly.

The list of formats this list excludes is:

A2 Apple II disk
AA Audio Cassette
AB Address Book
AJ Audio Recording Downloadable
AVI Video, AVI Format
AZ Audio, Other
BK Bookmark
BM Video, Betamax
C3 CD MP3
CD Compact Disc
CE Counterpack Empty
CF Counterpack Filled
CR CD-ROM
CRM CD-ROM, Macintosh
CRW CD-ROM, Windows
CX CD Extra, Audio
D3 3.5 Diskette
DA DVD, Audio
DD Day by Day Calendar
DF Dumpbin Filled
DI Diary, Journal, Blank Book
DL Desk Calendar
DO Doll
DR DVD-ROM
DRM DVD-ROM, Macintosh
DRW DVD-ROM, Windows
DS Diskette
DV DVD
EC Engagement Calendar
FC Cards,Flash Cards
FI Fiche
FM Filmstrip
FR Frieze
FZ Film, Other
GB Globe
GC Game Cartridge
GM Game
GR Greeting Card
H3 1.44M, 3.5 Disk, DOS
H5 1.2M, 5.25 Disk, DOS
L3 720K, 3.5 Disk, DOS
L5 360K, 5.25 Disk, DOS
LP LP Record
MAC Macintosh
MC Mini Calendar
MF Microfilm
MG Mug
MH 1.44M, Mac
ML 800K, Mac
MS MS-DOS
MSX Microsoft XBox
MZ Microfilm, Other
N64 Nintendo 64
NGA Nintendo Gameboy Advanced
NGB Nintendo Gameboy
NGC Nintendo Gameboy Color
NGE Nintendo Gamecube
NT NTSC
OR Online Resource
OS Other Operating System
PC Calendar
PP Postcard Book or Pack
PRP Promotional Poster
PS Poster
PSC Poster Calendar
PY Video, PlayStation Portables
QU Video, Quicktime Format
RE Real Audio Format
RV Real Video
SA Super Audio Format
SD Sega Dreamcast
SG Sega Genesis
SH Shelf Strip
SK Stickers
SL Slides
SMD Sony Mini Disc, Audio
SN Spinner
SO Soft Toy
SO1 Sony Playstation1
SO2 Sony Playstation2
SR Sheet Map, Rolled
SU Sega Saturn
TA Tape Reel (Audio, not computer tape)
TB Tube
TR Transparencies
TS T-Shirt
TY Toy; Plush; Doll
UX UNIX
V35 35MM
V8 Video, 8mm
VC VCD (VideoCD)
VD Video, VHS Format
VE SECAM
VF Film, 16mm
VK Video Disc
VM SVCD (SuperVideoCD)
VN HD DVD
VO Blu-Ray
VP PAL
VS Video, Super VHS
VU Video, 3/4 U-Matic
VY UMD Video (Sony Universal Media disc
VZ Video, Other
WA WAV Format, Audio
WC Wallchart
WI Windows
WL Wall Calendar
WM Video, Windows Media Format
WP Window Piece
WT Wallet
WX Quantity Pack
XL Shrink-wrapped Pack
XZ Promotional, Other
ZF Zip Fastener
ZZ Merchandise, Other

Which should be uncontroversial. If this is preventing too many imports, the data should be re-checked for appropriateness.

Technical

Testing

Have added a test to show how this prevents correctly described blank notebooks from being imported -- many of which have already been added to OL without this basic checking:

https://openlibrary.org/search?q=title%3A+%22Moleskine+Cahier%22&mode=everything

Moleskine seems to be a reputable publisher that correctly distinguishes their notebooks in basic metadata.
Not all of author https://openlibrary.org/authors/OL3186674A/Moleskine books are notebooks, https://openlibrary.org/works/OL21075128W/Grafton_Architects looks like a real design book with a WorldCat entry: https://www.worldcat.org/title/grafton-architects-inspiration-and-process-in-architecture/oclc/909366078 so some care should be taken with clean up too.

This will also stop bookmarks:
https://openlibrary.org/works/OL21568824W/Indigo_Magnetic_Bookmarks
and dolls:
https://openlibrary.org/works/OL17411486W/Darth_Vader_In_A_Box_Together_We_Can_Rule_The_Galaxy

tote bags:
https://openlibrary.org/works/OL25718266W/Secret_Garden_BabyLit_Tote

Pens:
https://openlibrary.org/works/OL21137719W/Bright_Ideas_-_20_Double-Ended_Colored_Brush_Pens

Mugs:
https://openlibrary.org/works/OL20336373W/Keep_Calm_and_Hang_On_Mug

Tattoos? (I'm not sure exactly what this is, it doesn't seem obvious after import that it's not a book, but the publisher metadata states it is "Merchandise, Other" ):
https://openlibrary.org/works/OL21139184W/There%27s_No_Place_Like_Home

T-shirts: (test added)
AUS49852633|1423639103||9781423639107|US|I|TS||||I Like Big Books T-Shirt X-Large|||1 vol.|||||||20141201|Gibbs Smith, Publisher|DE|X||||||||||||||ENG||0.280|27.940|22.860|2.540||||||T|||||||20748||||||326333|AUD|39.99||||||||||||||||||||||||||||||||||||||||||||||||||||||||||BIP,OTH|49852633|35|9781423639107|49099247|||19801468|||||||Gibbs Smith, Publisher||1||||||||||||||||NON000000|||WZ|||
https://openlibrary.org/works/OL25716783W/I_Like_Big_Books_T-Shirt_X-Large

Most of this author / publishers items seem like other merch: https://openlibrary.org/authors/OL6831238A/Gibbs_Smith_Publisher

Puzzles:
https://openlibrary.org/search?q=title%3A+%221000+Piece%22&mode=everything
https://openlibrary.org/search?q=title%3A+%22500+Piece%22&mode=everything
https://openlibrary.org/search?q=title%3A+%22300+Piece%22&mode=everything
not exhaustive, just some obvious title matches

Origami paper:
https://openlibrary.org/search?q=title%3A+%22Origami+Paper%22&mode=everything

Some of these origami papers (the ones with images?) were originally imported from Amazon, but reimported multiple times by Importbot e.g. https://openlibrary.org/books/OL7931200M/Origami_Paper_Dots

There's years of random non-books products which have been imported here.

I'm not a fan of the indiscriminate importing of bookseller data like this by Import Bot. Some basic checking up front when the metadata is just sitting there is best. Tidying up some of these after the fact is going to be hard. These are just some random examples I was able to find from looking at a small subset of the input. This original PR was meant to be a minimal attempt to catch the worst non-books. There are plenty of notebooks which will sneak past this filter due to poor (possibly deliberately misleading) categorisation. I don't understand why it was disabled.

Bookseller data needs more quality checks than library MARC imports, but currently Open Library has much less and appears to be deliberately taking a quantity over quality approach.

Screenshot

Stakeholders

@mekarpeles @seabelis @LeadSongDog @cdrini

@hornc
Copy link
Collaborator Author

hornc commented Mar 14, 2022

Here's an example of a notebook which won't be caught by this filter because it is a TP Trade Paper format, which is a 'book' format:

USA78730010||||OD|I|TP||||Notebook Planner Doki Doki Literature Club Chibi Halloween Party DLC042 : Diary, Personal Budget, 114 Pages, Lesson, 6x9 Inch, Planning, High Performance, Paycheck Budget||||||||||20201231|Macdonald, Samiha|AU||||||||||||||114|ENG||0.227|22.860|15.240|0.724||||||T|||||||1776181||||||1781863|USD|6.99||||||||||||||||||||||||||||||||||||||||||||||||||||||||||BIP|78730010||9798588765800|72152837|||79448755|||||||Independently Published||||||||||||||||||BUS042000||||||

@hornc hornc changed the title re-enable non-book filtering re-enable non-book filtering on importbot Mar 14, 2022
@hornc hornc changed the title re-enable non-book filtering on importbot Re-enable non-book filtering on importbot Mar 14, 2022
@hornc hornc changed the title Re-enable non-book filtering on importbot Re-enable non-book filtering on importbot non-MARC imports Mar 14, 2022
@mekarpeles mekarpeles self-assigned this Mar 14, 2022
@mekarpeles mekarpeles added the Priority: 2 Important, as time permits. [managed] label Mar 14, 2022
@hornc hornc added the Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] label Mar 18, 2022
@mekarpeles mekarpeles merged commit 8717e18 into internetarchive:master Mar 29, 2022
@LeadSongDog
Copy link

@hornc @mekarpeles
Thank you, that’s a great start. Any suggestions on how cleanup can be addressed at scale?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 2 Important, as time permits. [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants