Command for creating a SQLite database of the OCR results #2

simonw · 2022-06-29T02:56:42Z

This command will suck down the OCR data from the bucket and use it to build a database.

simonw · 2022-06-29T02:58:23Z

This command could take quite a while - my test bucket generated 2GB of JSON in the textract-output/ folder!

As such, I think a progress bar is a good idea.

Since this will be working against the listed keys in that prefix, and the list objects call provides their sizes, the progress bar can be based on the total size of JSON that needs to be processed.

simonw · 2022-06-29T02:59:06Z

An option to download and save the full JSON to local disk would be useful, plus a way to run the command against that cached directory of data. This would support iteration during development without sucking down 2GB of data every time.

simonw · 2022-06-29T02:59:41Z

My local (unpublished) prototype of this feature is at http://localhost:8888/notebooks/Make%20database%20of%20s3-ocr%20data%20for%20sfms-history.ipynb

simonw · 2022-06-29T03:13:17Z

This will work best if the local SQLite database can work as a cache, to minimize the amount of traffic to S3 and to allow the command to be quit and resumed, and run again to import fresh data.

It needs to fetch all of the .s3-ocr.json files in order to associate job IDs with filenames - and in the future to spot when the ETag of a file has changed and it needs to have a new job run against it.

500 PDFs = 500 .s3-ocr.json files = 500 GETs. So I'm going to cache their contents in a table and include the ETag of the .s3-ocr.json file too (trying to name it in a way that avoids confusion with the other ETag it stores) - then I can tell if one of those files has been updated and needs to be re-fetched by looking at the ETags from the list.

Having fetched those .s3-ocr.json files it can associate the keys with the textract-output/JOB_ID/page result keys.

So now it can fetch the job results for each file that hasn't been fetched yet, parse them and use them to populate the searchable text table.

Refs #2, #3

simonw added the enhancement label Jun 29, 2022

simonw added a commit that referenced this issue Jun 29, 2022

Link to issue #2 from README

bb62401

simonw added a commit that referenced this issue Jun 29, 2022

s3-ocr index command, refs #2

a78d47e

simonw closed this as completed Jun 29, 2022

simonw added a commit that referenced this issue Jun 29, 2022

Release 0.2a0

6429a91

Refs #2, #3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Command for creating a SQLite database of the OCR results #2

Command for creating a SQLite database of the OCR results #2

simonw commented Jun 29, 2022

simonw commented Jun 29, 2022

simonw commented Jun 29, 2022

simonw commented Jun 29, 2022

simonw commented Jun 29, 2022

Command for creating a SQLite database of the OCR results #2

Command for creating a SQLite database of the OCR results #2

Comments

simonw commented Jun 29, 2022

simonw commented Jun 29, 2022

simonw commented Jun 29, 2022

simonw commented Jun 29, 2022

simonw commented Jun 29, 2022