Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importbot importing books that are of bad quality #6283

Conversation

jennshan
Copy link
Contributor

in this commit a new function is added that checks a book's title and other attributes for known exclusions.
The effect of this commit is that, the importbot will discriminate low quality books

Closes #4151

hotfix

Technical

This is not a complete solution yet, it is just an initial try

Testing

The importbot must be its periodic imports, filtering books that have the keyword "Notebook" in their titles and the publisher is independent publisher.

Screenshot

Stakeholders

Copy link
Contributor

@jimman2003 jimman2003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filter seems like a good first step forward, notes about the implementation of the filtering function

Copy link
Contributor

@jimman2003 jimman2003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_low_quality_book function can be rewritten even more readable (atm) as return (notebook" in book_title.casefold() and "independently published" in book_item.publisher.casefold()). Alternatively:

scripts/partner_batch_imports.py Outdated Show resolved Hide resolved
scripts/partner_batch_imports.py Outdated Show resolved Hide resolved
in this commit a new function is added that checks a book's title and other attributes for known exclusions.
The effect of this commit is that, the importbot will discriminate low quality books
…ment.

now the substring check is case-insensitive and also book_item.title.split() is removed.
importbot discriminates the low quality books
@mekarpeles mekarpeles self-assigned this Mar 14, 2022
@mekarpeles mekarpeles requested review from mekarpeles and cdrini March 14, 2022 19:41
@mekarpeles
Copy link
Member

Thank you @jennshan for taking a stab at this and @jimman2003 for your review

This isn't the easiest issue to get started with and we really appreciate your contribution to fixing a big set of problems regarding imports!

@mekarpeles mekarpeles assigned cdrini and unassigned cdrini Mar 14, 2022
@jennshan jennshan force-pushed the 4151/hotfix/bwb-importbot-low-quality-records branch from e6c3644 to 221dd0b Compare March 14, 2022 19:46
@mekarpeles mekarpeles added the Priority: 2 Important, as time permits. [managed] label Mar 14, 2022
@mekarpeles
Copy link
Member

Looks like a good first step! Nicely done @jennshan! And thank you for helping lead the code review @jimman2003!

@mekarpeles mekarpeles merged commit 5d4d871 into internetarchive:master Mar 29, 2022
@@ -173,6 +173,10 @@ def csv_to_ol_json_item(line):
b = Biblio(data)
return {'ia_id': b.source_id, 'data': b.json()}

def is_low_quality_book(book_item):
"""check if a book item is of low quality"""
return ("notebook" in book_item.title.casefold() and "independently published" in book_item.publisher.casefold())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be English-only code. Is that still allowed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bulk of the spammy books were in English; we should add more checks here as we find more bad auto-imports. ( https://docs.google.com/spreadsheets/d/1mG8Tn-sx73fdcHNAvFCJ-2p_-ULT_5QH90y5LLMpbtY/edit#gid=0 ). Good choice of query, @jennshan !

@LeadSongDog
Copy link

LeadSongDog commented Apr 7, 2022

Glad to see some restraint is being applied to ImportBot, but it’s a much bigger problem than just notebooks. A few bad actors are creating massive numbers of bogus editions of classic works (many taken from IA), listing them on BWB, and we are re-inhaling them willy-nilly with dozens or even hundreds of ISBNs each. Many have BWB listings with bogus extra authors such as "Mint Editions": these lead to extra work records too!

1492BECB-E16D-46BC-8461-64933A1D6152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 2 Important, as time permits. [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix BWB Importbot Low Quality records
6 participants