-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importbot importing books that are of bad quality #6283
importbot importing books that are of bad quality #6283
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filter seems like a good first step forward, notes about the implementation of the filtering function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is_low_quality_book function can be rewritten even more readable (atm) as return (notebook" in book_title.casefold() and "independently published" in book_item.publisher.casefold())
. Alternatively:
in this commit a new function is added that checks a book's title and other attributes for known exclusions. The effect of this commit is that, the importbot will discriminate low quality books
…ment. now the substring check is case-insensitive and also book_item.title.split() is removed. importbot discriminates the low quality books
Thank you @jennshan for taking a stab at this and @jimman2003 for your review This isn't the easiest issue to get started with and we really appreciate your contribution to fixing a big set of problems regarding imports! |
e6c3644
to
221dd0b
Compare
Looks like a good first step! Nicely done @jennshan! And thank you for helping lead the code review @jimman2003! |
@@ -173,6 +173,10 @@ def csv_to_ol_json_item(line): | |||
b = Biblio(data) | |||
return {'ia_id': b.source_id, 'data': b.json()} | |||
|
|||
def is_low_quality_book(book_item): | |||
"""check if a book item is of low quality""" | |||
return ("notebook" in book_item.title.casefold() and "independently published" in book_item.publisher.casefold()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be English-only code. Is that still allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bulk of the spammy books were in English; we should add more checks here as we find more bad auto-imports. ( https://docs.google.com/spreadsheets/d/1mG8Tn-sx73fdcHNAvFCJ-2p_-ULT_5QH90y5LLMpbtY/edit#gid=0 ). Good choice of query, @jennshan !
Glad to see some restraint is being applied to ImportBot, but it’s a much bigger problem than just notebooks. A few bad actors are creating massive numbers of bogus editions of classic works (many taken from IA), listing them on BWB, and we are re-inhaling them willy-nilly with dozens or even hundreds of ISBNs each. Many have BWB listings with bogus extra authors such as "Mint Editions": these lead to extra work records too! |
in this commit a new function is added that checks a book's title and other attributes for known exclusions.
The effect of this commit is that, the importbot will discriminate low quality books
Closes #4151
hotfix
Technical
This is not a complete solution yet, it is just an initial try
Testing
The importbot must be its periodic imports, filtering books that have the keyword "Notebook" in their titles and the publisher is independent publisher.
Screenshot
Stakeholders