- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command for creating a SQLite database of the OCR results #2
Comments
This command could take quite a while - my test bucket generated 2GB of JSON in the As such, I think a progress bar is a good idea. Since this will be working against the listed keys in that prefix, and the list objects call provides their sizes, the progress bar can be based on the total size of JSON that needs to be processed. |
An option to download and save the full JSON to local disk would be useful, plus a way to run the command against that cached directory of data. This would support iteration during development without sucking down 2GB of data every time. |
My local (unpublished) prototype of this feature is at http://localhost:8888/notebooks/Make%20database%20of%20s3-ocr%20data%20for%20sfms-history.ipynb |
This will work best if the local SQLite database can work as a cache, to minimize the amount of traffic to S3 and to allow the command to be quit and resumed, and run again to import fresh data. It needs to fetch all of the 500 PDFs = 500 Having fetched those So now it can fetch the job results for each file that hasn't been fetched yet, parse them and use them to populate the searchable text table. |
This command will suck down the OCR data from the bucket and use it to build a database.
The text was updated successfully, but these errors were encountered: