Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command for creating a SQLite database of the OCR results #2

Closed
simonw opened this issue Jun 29, 2022 · 4 comments
Closed

Command for creating a SQLite database of the OCR results #2

simonw opened this issue Jun 29, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Jun 29, 2022

This command will suck down the OCR data from the bucket and use it to build a database.

@simonw simonw added the enhancement New feature or request label Jun 29, 2022
@simonw
Copy link
Owner Author

simonw commented Jun 29, 2022

This command could take quite a while - my test bucket generated 2GB of JSON in the textract-output/ folder!

As such, I think a progress bar is a good idea.

Since this will be working against the listed keys in that prefix, and the list objects call provides their sizes, the progress bar can be based on the total size of JSON that needs to be processed.

@simonw
Copy link
Owner Author

simonw commented Jun 29, 2022

An option to download and save the full JSON to local disk would be useful, plus a way to run the command against that cached directory of data. This would support iteration during development without sucking down 2GB of data every time.

@simonw
Copy link
Owner Author

simonw commented Jun 29, 2022

My local (unpublished) prototype of this feature is at http://localhost:8888/notebooks/Make%20database%20of%20s3-ocr%20data%20for%20sfms-history.ipynb

simonw added a commit that referenced this issue Jun 29, 2022
@simonw
Copy link
Owner Author

simonw commented Jun 29, 2022

This will work best if the local SQLite database can work as a cache, to minimize the amount of traffic to S3 and to allow the command to be quit and resumed, and run again to import fresh data.

It needs to fetch all of the .s3-ocr.json files in order to associate job IDs with filenames - and in the future to spot when the ETag of a file has changed and it needs to have a new job run against it.

500 PDFs = 500 .s3-ocr.json files = 500 GETs. So I'm going to cache their contents in a table and include the ETag of the .s3-ocr.json file too (trying to name it in a way that avoids confusion with the other ETag it stores) - then I can tell if one of those files has been updated and needs to be re-fetched by looking at the ETags from the list.

Having fetched those .s3-ocr.json files it can associate the keys with the textract-output/JOB_ID/page result keys.

So now it can fetch the job results for each file that hasn't been fetched yet, parse them and use them to populate the searchable text table.

simonw added a commit that referenced this issue Jun 29, 2022
@simonw simonw closed this as completed Jun 29, 2022
simonw added a commit that referenced this issue Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant