Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a live demo #16

Closed
simonw opened this issue Jun 30, 2022 · 11 comments
Closed

Add a live demo #16

simonw opened this issue Jun 30, 2022 · 11 comments
Labels
documentation Improvements or additions to documentation

Comments

@simonw
Copy link
Owner

simonw commented Jun 30, 2022

The demo can run in GitHub Actions, against a demo S3 bucket created for the purpose.

I'll deploy the resulting database file using Datasette, at s3-ocr-demo.datasette.io.

@simonw simonw added the documentation Improvements or additions to documentation label Jun 30, 2022
@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

I added a new --statement option to s3-credentials to make it easier to create a dedicated access key for the purpose of this demo:

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

I need some PDF files! Internet Archive has a bunch of interesting ones that are out of copyright and that demonstrate hand-writing.

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

I'm going to grab some of the PDFs from https://archive.org/search.php?query=creator%3A%22Harry+Houdini+Collection+%28Library+of+Congress%29+DLC%22 - "Harry Houdini Collection (Library of Congress) DLC"

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' -c > ocr.json

I grabbed those PDFs and uploaded them to the bucket like this:

for f in $(ls *.pdf); do s3-credentials put-object s3-ocr-demo $f $f -a ocr.json; done

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

I started OCR like this:

% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

I ran this to keep an eye on how it was going (I should upgrade status for this):

% s3-credentials list-bucket s3-ocr-demo | jq '.[].Key'
"latestmagicbeing00hoff.pdf"
"latestmagicbeing00hoff.pdf.s3-ocr.json"
"practicalmagicia00harr.pdf"
"practicalmagicia00harr.pdf.s3-ocr.json"
"textract-output/93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9/.s3_access_check"
"textract-output/93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9/1"
"textract-output/93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9/2"
"textract-output/ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55/.s3_access_check"
"textract-output/f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402/.s3_access_check"
"unmaskingrobert00houdgoog.pdf"
"unmaskingrobert00houdgoog.pdf.s3-ocr.json"

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

After a while it was done:

% s3-credentials list-bucket s3-ocr-demo -a ocr.json | jq '.[].Key' | grep 'textract' | wc -l
     207

Then I ran this:

% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

I'm not going to bother with GitHub Actions for this - I'm going to generate and deploy the demo from my laptop. I may automate this with GitHub Actions at a later date.

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

Deployed to Vercel:

datasette publish vercel pages.db \
  --project s3-ocr-demo \
  --about 's3-ocr demo' \
  --about_url 'https://datasette.io/tools/s3-ocr' \
  --source 'Library of Congress' \
  --source_url 'https://github.com/simonw/s3-ocr/issues/16' \
  --scope datasette

I added a custom domain to it too (since datasette.io belongs to my simonw account and not the datasette scope I had to add a _vercel TXT record to the domain to get that to work).

It is now live at https://s3-ocr-demo.datasette.io/

@simonw
Copy link
Owner Author

simonw commented Jun 30, 2022

@simonw simonw closed this as completed in 0cc244d Jun 30, 2022
simonw added a commit that referenced this issue Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant