Add a live demo #16

simonw · 2022-06-30T20:04:32Z

The demo can run in GitHub Actions, against a demo S3 bucket created for the purpose.

I'll deploy the resulting database file using Datasette, at s3-ocr-demo.datasette.io.

The text was updated successfully, but these errors were encountered:

simonw · 2022-06-30T20:05:04Z

I added a new --statement option to s3-credentials to make it easier to create a dedicated access key for the purpose of this demo:

Make it easier to add extra policy statements s3-credentials#72

simonw · 2022-06-30T20:05:53Z

I need some PDF files! Internet Archive has a bunch of interesting ones that are out of copyright and that demonstrate hand-writing.

simonw · 2022-06-30T20:13:34Z

I'm going to grab some of the PDFs from https://archive.org/search.php?query=creator%3A%22Harry+Houdini+Collection+%28Library+of+Congress%29+DLC%22 - "Harry Houdini Collection (Library of Congress) DLC"

simonw · 2022-06-30T20:16:21Z

simonw · 2022-06-30T20:33:29Z

s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' -c > ocr.json

I grabbed those PDFs and uploaded them to the bucket like this:

for f in $(ls *.pdf); do s3-credentials put-object s3-ocr-demo $f $f -a ocr.json; done

simonw · 2022-06-30T20:34:08Z

I started OCR like this:

% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9

simonw · 2022-06-30T20:36:22Z

I ran this to keep an eye on how it was going (I should upgrade status for this):

% s3-credentials list-bucket s3-ocr-demo | jq '.[].Key'
"latestmagicbeing00hoff.pdf"
"latestmagicbeing00hoff.pdf.s3-ocr.json"
"practicalmagicia00harr.pdf"
"practicalmagicia00harr.pdf.s3-ocr.json"
"textract-output/93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9/.s3_access_check"
"textract-output/93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9/1"
"textract-output/93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9/2"
"textract-output/ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55/.s3_access_check"
"textract-output/f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402/.s3_access_check"
"unmaskingrobert00houdgoog.pdf"
"unmaskingrobert00houdgoog.pdf.s3-ocr.json"

simonw · 2022-06-30T20:38:51Z

After a while it was done:

% s3-credentials list-bucket s3-ocr-demo -a ocr.json | jq '.[].Key' | grep 'textract' | wc -l
     207

Then I ran this:

% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34

simonw · 2022-06-30T20:39:27Z

I'm not going to bother with GitHub Actions for this - I'm going to generate and deploy the demo from my laptop. I may automate this with GitHub Actions at a later date.

simonw · 2022-06-30T20:45:13Z

Deployed to Vercel:

datasette publish vercel pages.db \
  --project s3-ocr-demo \
  --about 's3-ocr demo' \
  --about_url 'https://datasette.io/tools/s3-ocr' \
  --source 'Library of Congress' \
  --source_url 'https://github.com/simonw/s3-ocr/issues/16' \
  --scope datasette

I added a custom domain to it too (since datasette.io belongs to my simonw account and not the datasette scope I had to add a _vercel TXT record to the domain to get that to work).

It is now live at https://s3-ocr-demo.datasette.io/

simonw · 2022-06-30T20:45:40Z

Demo: https://s3-ocr-demo.datasette.io/pages/pages?_search=harry

Refs #15, #16

simonw added the documentation Improvements or additions to documentation label Jun 30, 2022

simonw closed this as completed in 0cc244d Jun 30, 2022

simonw added a commit that referenced this issue Jun 30, 2022

Release 0.4

46712e9

Refs #15, #16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a live demo #16

Add a live demo #16

simonw commented Jun 30, 2022 •

edited

Loading

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

Add a live demo #16

Add a live demo #16

Comments

simonw commented Jun 30, 2022 • edited Loading

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022 •

edited

Loading