Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imports: add importer for ISBNdb #8511

Merged
merged 13 commits into from
Nov 18, 2023

Conversation

scottbarnes
Copy link
Collaborator

@scottbarnes scottbarnes commented Nov 9, 2023

Partially closes #7658

Feature.

Technical

This adds scripts/providers/isbndb.py, which is adapted from scripts/partner_batch_imports.py. It should include:

  • valid data mapping from an ISBNdb record to an OL record;
  • a log of where the script is in the list of ISBNdb records, written as import.log;
  • the ability to gracefully restart (shamelessly taken from partner_batch_imports.py; and
  • a status of staged in the import_item db.

Issues / room for improvement

This implementation has an impermissible amount of copy/pasted code from partner_batch_imports.py, owing to the desire to have this working quickly and properly. Both files need refactoring to DRY this up. This may also require changes to cron.

Possible bug

A known bug, which I think is also present in partner_batch_imports.py, is that import.log is not updated via update_state() when a particular item can't be imported, so upon resume it will try that item again, print an error to logger.info(), and continue.

Steps for importing from ISBNdb JSONL dump

If using locally, with a theoretical directory named 'isbndb_batches' in the openlibrary root:
that holds an ISBNdb dump named isbndb.jsonl (anything with an isbndb prefix should work), you'd run:

docker compose exec -e PYTHONPATH="." web python ./scripts/providers/isbndb.py config/openlibrary.yml isbndb_batches
# Optional, for promise items from https://archive.org/details/bwb_daily_pallets_2023-11-11
docker compose exec -e PYTHONPATH="." web python ./scripts/promise_batch_imports.py config/openlibrary.yml bwb_daily_pallets_2023-11-11
docker compose exec -e PYTHONPATH="." web python ./scripts/manage_imports.py --config config/openlibrary.yml import-all

The first command would load the database with staged ISBNdb items,
and the second (optional) would import promise items and mark relevant ISBNs from the ISBNdb dump as pending, and the third command would start the import script to process them.

Note: in a local environment you may run into permissions issues, so a quick (local only) fix would be chmod -R 777 isbndb_batches, to get around the Docker permissions issues. But note issues chmod 777 brings in terms of global rwx.

Steps for using ISBNdb as a backing store for /isbn

See above, and do the following step, and its prerequesities:

docker compose exec -e PYTHONPATH="." web python ./scripts/providers/isbndb.py config/openlibrary.yml isbndb_batches

Then visit /isbn/{some_isbn_here_from_the_isbndb_dump}, and it should import, and the history should note the source as ISBNdb.

How the records look, using a promise item flow (ignore the unfortunate fact this is a CD-ROM)

After initial import, the item is staged:

 8428 |        1 | 2023-11-13 23:10:09.830838 |             | staged |       | idb:9781857580143 | {"authors": [{"name": "Moore, Stephen"}], "isbn_13": ["9781857580143"], "languages": ["eng"], "number_of_pages": 220, "publish_date": "1992", "publishers": ["Letts Educational"], "sourc
e_records": ["idb:9781857580143"], "title": "Revise Sociology (GCSE CD-ROM Revision Guides)"} 

Next, after promise_batch_imports.py the item is marked as pending:

openlibrary=# SELECT * FROM import_item WHERE id = 8428;                                                                                      
  id  | batch_id |         added_time         | import_time | status  | error |       ia_id       |                                                                                                                                          data                                           
                                                                                               | ol_key | comments                                                                                                                                                                          
------+----------+----------------------------+-------------+---------+-------+-----------                                                                                                                                                                        
 8428 |        1 | 2023-11-13 23:10:09.830838 |             | pending |       | idb:9781857580143 | {"authors": [{"name": "Moore, Stephen"}], "isbn_13": ["9781857580143"], "languages": ["eng"], "number_of_pages": 220, "publish_date": "1992", "publishers": ["Letts Educational"], "sour
ce_records": ["idb:9781857580143"], "title": "Revise Sociology (GCSE CD-ROM Revision Guides)"} |

Finally, once docker compose exec -e PYTHONPATH="." web python ./scripts/manage_imports.py --config config/openlibrary.yml import-all is run, it's modified:

openlibrary=# SELECT * FROM import_item WHERE id = 8428;               
  id  | batch_id |         added_time         |        import_time         |  status  | error |       ia_id       | data |     ol_key     | comments                                                                                                                                        
------+----------+----------------------------+----------------------------+----------+--                                                                                                      
 8428 |        1 | 2023-11-13 23:10:09.830838 | 2023-11-13 23:29:03.733418 | modified |       | idb:9781857580143 |      | /books/OL3888M |  

Testing

There are minimal unit tests. I can include more output if it is useful.

The above database query shows the results, with the status being staged, and the data being unmarshalled into a format that is suitable for OL import.

  • If we DRY up the parse_data and /api/import path in plugins.importapi.ImportAPI then let's make sure that /api/imports works still
  • Also that /isbn works (does not show a generic stack trace / 12093810923.html error page) with a record missing obvious data (e.g. no isbn, title, authors, etc)

Stakeholders

@mekarpeles

@mekarpeles mekarpeles self-assigned this Nov 13, 2023
@mekarpeles mekarpeles added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Nov 13, 2023
docker/ol-importbot-start.sh Outdated Show resolved Hide resolved
openlibrary/core/models.py Outdated Show resolved Hide resolved
openlibrary/core/models.py Outdated Show resolved Hide resolved
@@ -12,9 +12,9 @@
$ record = get_source_record(record_id)
$if v.revision == 1:
$ record_type = ''
$if record.source_name not in ('amazon.com', 'Better World Books', 'Promise Item'):
$if record.source_name not in ('amazon.com', 'Better World Books', 'Promise Item', 'ISBNdb'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have bug #2643

scripts/promise_batch_imports.py Outdated Show resolved Hide resolved
vendor/infogami Outdated Show resolved Hide resolved
Change `ImportItem.find_pending()` so that it returns a `map` if and only
if the `map` is not empty, and otherwise return `None`.

Without this, `manage_imports.import_all` doesn't sleep, which:
1. Causes text to scroll by faster than can possibly be ready, and
2. Causes load averages to spike, and consumes the entire CPU of one
   core (in my local dev environment, anyway).

Currently, the conditional for when to sleep for 60 seconds after
checking for a batch and finding nothing is never true, because it
returned a `map`, which is always truthy:
```
>>> m = map(len, [])
>>> bool(m)
True
>>> next(m)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
```
@scottbarnes scottbarnes force-pushed the bulk-import-isbndb branch 2 times, most recently from 3a50e7e to 0499ad9 Compare November 15, 2023 15:23
If using locally, with a theoretical directory named 'isbndb_batches'
that holds an ISBNdb dump named `isbndb.jsonl`, you'd run:
```
docker compose exec -e PYTHONPATH="." web python ./scripts/providers/isbndb.py config/openlibrary.yml isbndb_batches
docker compose exec -e PYTHONPATH="." web python ./scripts/manage_imports.py --config config/openlibrary.yml import-all
```

The first command would load the database with `staged` ISBNdb items,
and the second command would start the import script to process them.

Note: in a local environment you may run into permissions issues, so a
quick (local only) fix would be `chmod -R 777 isbndb_batches`, to get
around the Docker permissions issues. But note issues `chmod 777` brings
in terms of global `rwx`.
…port source

Note: this does not yet reenable AMZ as a source for Edition.from_isbn,
which `/isbn` imports use.
Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to be adding additional low quality data sources when the users are still struggling to clean up the last enormous dump of poor quality data.

The garbage in the example data amply illustrates just how bad this is. GIGO

'msrp': '0.00',
'title': '確定申告、住宅ローン控除とは?',
'isbn13': '9780000002259',
'authors': ['田中 卓也 ~autofilled~'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autofilled?

'msrp': '1.99',
'image': 'Https://images.isbndb.com/covers/01/01/9780000000101.jpg',
'pages': 8,
'title': 'Nga Aboriginal Art Cal 2000',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An aboriginal art calendar isn't a book

'pages': 8,
'title': 'Nga Aboriginal Art Cal 2000',
'isbn13': '9780000000101',
'authors': ['Nelson, Bob, Ph.D.'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is an inspirational speaker authoring an art calendar with the subject of "mushrooms"?

'edition': '1',
'language': 'en',
'subjects': ['PQ', '878'],
'synopsis': 'Francesco Petrarca.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synopsis is the name of an Italian renaissance humanist? (and when the Japanese title references bridal gown trends)

scripts/tests/test_isbndb.py Show resolved Hide resolved
This script was renamed to make it easier to import from, but it turns
out for now this is not necessary. Formerly the `do_import` function was
being imported into core.models.Edition.from_isbn(), but now that
imports from `load()` directly, so the rename is not necessary.
This functionality will be moved into the affiliate server.
ol.autologin()
if os.getenv('LOCAL_DEV'):
ol = OpenLibrary(base_url="http://localhost:8080")
ol.login("admin", "admin123")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdrini, @scottbarnes mentioned we may want to turn these creds for localhost which are also used in copydocs and make them environment variables in docker rather than coding them here. (for another PR)

@scottbarnes scottbarnes force-pushed the bulk-import-isbndb branch 4 times, most recently from 60530b3 to 25f6f51 Compare November 16, 2023 18:50
To validate data for parse_data(), we fill in some dummy data, and then
remove it as soon as parse_data() runs. But this means if anyone wants
to call load(), they need to call parse_data() to get the rec
appropriate for load(), then 'manually' remove the dummy data.

This commit moves the cleaning of the dummy data to catalog.addbook
inside the normalize_import_record() function, which load() calls.
@scottbarnes scottbarnes force-pushed the bulk-import-isbndb branch 2 times, most recently from 52a4a5a to c2b0e0c Compare November 16, 2023 19:19
An alternative approach to reducing the business logic in
Edition.from_isbn, which also stops the race condition that can occur at
/isbn when identical concurrent requests are made to /isbn for the same
ISBN.
@scottbarnes scottbarnes added the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label Nov 18, 2023
Copy link
Member

@mekarpeles mekarpeles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most recent changes lgtm

@mekarpeles mekarpeles merged commit 05eca7c into internetarchive:master Nov 18, 2023
2 checks passed
@scottbarnes scottbarnes deleted the bulk-import-isbndb branch November 18, 2023 20:58
@jimchamp jimchamp removed the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stage ISBNdb Imports & Enable JIT Importing
4 participants