Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partner Imports: Detect low quality publishers #6611

Merged
merged 14 commits into from
Jun 10, 2022

Conversation

cclauss
Copy link
Contributor

@cclauss cclauss commented May 30, 2022

Closes #6573
Closes #6604

Create a set of LOW_QUALITY_PUBLISHERS (is there a better name? SPAM_PUBLISHERS?) and then for each book create a set of publishers. IF there is notebook in the book's title AND there is an intersection between the two sets THEN it is a low-quality book that we should not import.

Technical

Testing

See: scripts/tests/test_partner_batch_imports.py

>>> a = {p.casefold() for p in ["razal", "tobias publishing", "koraya", "pickleball", "d"]}
>>> b = {p.casefold() for p in ["hol", "mad", "mazz", "mikemi", "tobias publishers"]}
>>> c = {p.casefold() for p in ["pickleball publishing"]}
>>> a & low_quality_publishers
{'tobias publishing'}
>>> b & low_quality_publishers
set()
>>> c & low_quality_publishers
{'pickleball publishing'}

Screenshot

Stakeholders

Copy link
Collaborator

@RayBB RayBB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this approach!

@mekarpeles mekarpeles added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label May 31, 2022
@cclauss cclauss removed their assignment May 31, 2022
@cclauss cclauss force-pushed the low-quality-publishers branch from 0d2126c to ad86382 Compare June 2, 2022 17:20
@cdrini
Copy link
Collaborator

cdrini commented Jun 2, 2022

Can you remove the last commit? It looks like an autoformatter ran with it and I can't see the diff anymore

@cclauss
Copy link
Contributor Author

cclauss commented Jun 2, 2022

Oh, Fudge! I got sick of hand wrapping long lines so I ran black. big mistake. I will do the deed in my morning.

@cclauss
Copy link
Contributor Author

cclauss commented Jun 3, 2022

All quotes restored. Created #6624 to avoid quote grief on future pull requests until the follow-on to #6612 has landed.

@cdrini cdrini force-pushed the low-quality-publishers branch from ef4f477 to 5df18dc Compare June 9, 2022 17:18
@cdrini cdrini force-pushed the low-quality-publishers branch from 5df18dc to f03bfea Compare June 9, 2022 23:03
cclauss and others added 2 commits June 9, 2022 19:04
- Check publish_date field, not created
- Publish block list blocks import regardless of name match
- Add more title block words for independently published books
@cdrini cdrini force-pushed the low-quality-publishers branch from f03bfea to c5a15c4 Compare June 9, 2022 23:05
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I believe this should do the trick. Notable changes I made:

  1. These publishers are actually (confusingly) imported as authors ; so check the authors field
  2. created isn't on the record; publish_date is the field we seek
  3. The publisher-author exclude list is a hard exclude; it doesn't require the title to contain anything. Those publisher-authors should always be excluded

Let me know if anything looks off! If all is good, please feel free to merge :)

@cclauss cclauss merged commit de6ae10 into internetarchive:master Jun 10, 2022
@cclauss cclauss deleted the low-quality-publishers branch June 10, 2022 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Partners Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Projects
None yet
4 participants