Skip to content

Sample java code, to OCR input files (with Amazon Textract) and upload the outputted PDFs to tagtog πŸ€˜πŸš€.

Notifications You must be signed in to change notification settings

tagtog/java-ocr-amazon-textract-searchable-pdf

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

This is a fully-functioning sample repository showing:

  1. how to use an external OCR provider (in this case Amazon Textract).
  2. upload the resulting PDFs into tagtog.

The code is written in Java (11).

Screen Shot 2021-04-20 at 18 54 03

This code starts from an Amazon Textract Tutorial (original code) to OCR input files (PDFs or images) and convert them into "searchable PDFs" (i.e. PDFs with embedded text). These "searchable PDFs" are exactly what we want to upload to tagtog to then annotate them using tagtog Native PDF.

This respository adds additional utilities (e.g. traversing & processing recursively given directories) and using the tagtog Documents APIs to upload the results to a given tagtog project. Http requests are done with java, Apache HttpClient (4.5).

The main entry point is DemoTagtogOcr.java. The main ingredients of the code are 3:

  1. Call Amazon Textract API
  2. Translating the JSON output from Amazon Textract into a "searchable PDF" (with java pdfbox)
  3. Call the tagtog API to upload documents

🧱 Compile

git clone https://github.com/tagtog/java-ocr-amazon-textract-searchable-pdf.git
cd java-ocr-amazon-textract-searchable-pdf/src/SearchablePDF/

./compile.sh

⚑️ Run

# Set your tagtog credentials
export TAGTOG_USERNAME=???
export TAGTOG_PASSWORD=???
# export TAGTOG_DOMAIN=??? # optionally, override the tagtog domain, for example if you are running tagtog OnPremises

time ./run.sh MY_TAGTOG_OWNERNAME MY_TAGTOG_PROJECT MY_TAGTOG_FOLDER ...inputFilesOrDirectories

πŸ€“ Setup Amazon Textract

If you are new to AWS or unsure about the details, this is the complete AWS guide to get started with Amazon Textract.

In short, what you need is:

  1. Make sure you have an IAM user with AmazonTextractFullAccess permissions & with an access key.
  2. Configure your local aws credentials, with the [default] role pointing to that IAM user and also set your desired region.

πŸƒ Sample tagtog Project

Using this very same code, we OCR'ed the FUNSD dataset and uploaded the results into the tagtog public project: tagtog/FUNSD-OCRed πŸ˜ƒ.

We exactly ran (last update on 2021-04-20):

time ./run.sh tagtog FUNSD-OCRed testing_data ~/Downloads/dataset/testing_data/  # took around ~2m; 50 docs in total
time ./run.sh tagtog FUNSD-OCRed training_data ~/Downloads/dataset/training_data/  # took around ~6m; 149 docs in total

These are some sample annotated documents in tagtog.

Notes

The original demo code tends to create oversized PDFs and to write the embedded character offsets a little bit below the actual (visual) positions. These details can be tweaked and of course depend on the used OCR software.

Releases

No releases published

Packages

No packages published

Languages

  • Java 99.8%
  • Shell 0.2%