-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
imports: add importer for ISBNdb #8511
imports: add importer for ISBNdb #8511
Conversation
@@ -12,9 +12,9 @@ | |||
$ record = get_source_record(record_id) | |||
$if v.revision == 1: | |||
$ record_type = '' | |||
$if record.source_name not in ('amazon.com', 'Better World Books', 'Promise Item'): | |||
$if record.source_name not in ('amazon.com', 'Better World Books', 'Promise Item', 'ISBNdb'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have bug #2643
Change `ImportItem.find_pending()` so that it returns a `map` if and only if the `map` is not empty, and otherwise return `None`. Without this, `manage_imports.import_all` doesn't sleep, which: 1. Causes text to scroll by faster than can possibly be ready, and 2. Causes load averages to spike, and consumes the entire CPU of one core (in my local dev environment, anyway). Currently, the conditional for when to sleep for 60 seconds after checking for a batch and finding nothing is never true, because it returned a `map`, which is always truthy: ``` >>> m = map(len, []) >>> bool(m) True >>> next(m) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration ```
3a50e7e
to
0499ad9
Compare
If using locally, with a theoretical directory named 'isbndb_batches' that holds an ISBNdb dump named `isbndb.jsonl`, you'd run: ``` docker compose exec -e PYTHONPATH="." web python ./scripts/providers/isbndb.py config/openlibrary.yml isbndb_batches docker compose exec -e PYTHONPATH="." web python ./scripts/manage_imports.py --config config/openlibrary.yml import-all ``` The first command would load the database with `staged` ISBNdb items, and the second command would start the import script to process them. Note: in a local environment you may run into permissions issues, so a quick (local only) fix would be `chmod -R 777 isbndb_batches`, to get around the Docker permissions issues. But note issues `chmod 777` brings in terms of global `rwx`.
…port source Note: this does not yet reenable AMZ as a source for Edition.from_isbn, which `/isbn` imports use.
68c5995
to
0b1c06b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to be adding additional low quality data sources when the users are still struggling to clean up the last enormous dump of poor quality data.
The garbage in the example data amply illustrates just how bad this is. GIGO
'msrp': '0.00', | ||
'title': '確定申告、住宅ローン控除とは?', | ||
'isbn13': '9780000002259', | ||
'authors': ['田中 卓也 ~autofilled~'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Autofilled?
'msrp': '1.99', | ||
'image': 'Https://images.isbndb.com/covers/01/01/9780000000101.jpg', | ||
'pages': 8, | ||
'title': 'Nga Aboriginal Art Cal 2000', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An aboriginal art calendar isn't a book
'pages': 8, | ||
'title': 'Nga Aboriginal Art Cal 2000', | ||
'isbn13': '9780000000101', | ||
'authors': ['Nelson, Bob, Ph.D.'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is an inspirational speaker authoring an art calendar with the subject of "mushrooms"?
'edition': '1', | ||
'language': 'en', | ||
'subjects': ['PQ', '878'], | ||
'synopsis': 'Francesco Petrarca.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synopsis is the name of an Italian renaissance humanist? (and when the Japanese title references bridal gown trends)
This script was renamed to make it easier to import from, but it turns out for now this is not necessary. Formerly the `do_import` function was being imported into core.models.Edition.from_isbn(), but now that imports from `load()` directly, so the rename is not necessary.
This functionality will be moved into the affiliate server.
0b1c06b
to
bec372c
Compare
ol.autologin() | ||
if os.getenv('LOCAL_DEV'): | ||
ol = OpenLibrary(base_url="http://localhost:8080") | ||
ol.login("admin", "admin123") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdrini, @scottbarnes mentioned we may want to turn these creds for localhost which are also used in copydocs and make them environment variables in docker rather than coding them here. (for another PR)
60530b3
to
25f6f51
Compare
To validate data for parse_data(), we fill in some dummy data, and then remove it as soon as parse_data() runs. But this means if anyone wants to call load(), they need to call parse_data() to get the rec appropriate for load(), then 'manually' remove the dummy data. This commit moves the cleaning of the dummy data to catalog.addbook inside the normalize_import_record() function, which load() calls.
52a4a5a
to
c2b0e0c
Compare
An alternative approach to reducing the business logic in Edition.from_isbn, which also stops the race condition that can occur at /isbn when identical concurrent requests are made to /isbn for the same ISBN.
7a4ef39
to
f479cae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most recent changes lgtm
Partially closes #7658
Feature.
Technical
This adds
scripts/providers/isbndb.py
, which is adapted fromscripts/partner_batch_imports.py
. It should include:import.log
;partner_batch_imports.py
; andstatus
ofstaged
in theimport_item
db.Issues / room for improvement
This implementation has an impermissible amount of copy/pasted code from
partner_batch_imports.py
, owing to the desire to have this working quickly and properly. Both files need refactoring to DRY this up. This may also require changes to cron.Possible bug
A known bug, which I think is also present in
partner_batch_imports.py
, is thatimport.log
is not updated viaupdate_state()
when a particular item can't be imported, so upon resume it will try that item again, print an error tologger.info()
, and continue.Steps for importing from ISBNdb JSONL dump
If using locally, with a theoretical directory named 'isbndb_batches' in the
openlibrary
root:that holds an ISBNdb dump named
isbndb.jsonl
(anything with anisbndb
prefix should work), you'd run:The first command would load the database with
staged
ISBNdb items,and the second (optional) would import promise items and mark relevant ISBNs from the ISBNdb dump as
pending
, and the third command would start the import script to process them.Note: in a local environment you may run into permissions issues, so a quick (local only) fix would be
chmod -R 777 isbndb_batches
, to get around the Docker permissions issues. But note issueschmod 777
brings in terms of globalrwx
.Steps for using ISBNdb as a backing store for /isbn
See above, and do the following step, and its prerequesities:
Then visit
/isbn/{some_isbn_here_from_the_isbndb_dump}
, and it should import, and the history should note the source as ISBNdb.How the records look, using a promise item flow (ignore the unfortunate fact this is a CD-ROM)
After initial import, the item is staged:
Next, after
promise_batch_imports.py
the item is marked aspending
:Finally, once
docker compose exec -e PYTHONPATH="." web python ./scripts/manage_imports.py --config config/openlibrary.yml import-all
is run, it'smodified
:Testing
There are minimal unit tests. I can include more output if it is useful.
The above database query shows the results, with the status being
staged
, and the data being unmarshalled into a format that is suitable for OL import.Stakeholders
@mekarpeles