Uses the AWS Textract APIs DetectDocumentText/AnalyzeExpense/AnalyzeDocument to analyze an image or PDF file and store the analysis results as JSON.
https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeExpense.html https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html
Uses the AWS Rekognition API DetectText to analyze an photo (.PNG or .JPEG) to detect text in live scenes and store the analysis results as JSON.
https://docs.aws.amazon.com/rekognition/latest/APIReference/API_DetectText.html
Uses the synchronous version of this API, which does not require creating S3 files or monitoring the status of the job.
Inspired by Simon Willison's s3-ocr. I wanted something similar, but without the S3 overhead (and I could deal with the limitations of not supporting multi-page PDFs). Theoretically, this could also be done with aws-cli
, but that doesn't support synchronous mode.
Does not support AnalyzeID nor AnalyzeDocument with FeatureTypes = QUERIES.
My preferred method. Sets up a virtual environment, installs, and adds to your path, without touching system Python install.
https://github.com/pypa/pipx
pipx install git+https://github.com/mbafford/textract-cli.git
pip install git+https://github.com/mbafford/textract-cli.git
python3 -mvenv .env/
git clone https://github.com/mbafford/textract-cli.git
.env/bin/pip install -e textract-cli/
This uses boto3 for all of the API calls, so configuring the AWS credentials is the same as any other boto3 project. See using boto3 for more specifics.
- Standard image types (jpeg, png, tiff, gif, etc) should be supported
- folder mode only checks for a handful of file extensions
- single file mode submits a job for anything you pass, regardless of file type
- Single-page PDF is supported
- Multi-page PDF is NOT supported (only supported in asynchronous mode, in which case, s3-ocr is likely a better choice)
.env/bin/python textract.py [--text/--expenses/--forms/--tables/--image] <folder with images/image file> [folder with images/image file] [...]
Accepts multile files or folders as arguments.
For each folder specified, all files with the extensions JPG, JPEG, GIF, PNG, TIFF in that folder will be selected and uploaded.
For each file specified, the script does not check filetype or extension and assumes/trusts the file can be processed by AWS.
The script looks for images that don't have a corresponding output file already created.
So if the script hangs, crashes, etc - just re-run, and only files that still need to be processed will be processed.
To force a re-process, delete the relevant output files for the image file and re-run the script.
Since each analysis type creates and looks for its own JSON file, this can be run multiple times for each analysis type.
Outputs one or two files for each image file input:
- [image file].textract.text.json - The exact JSON returned by the AWS API
- [image file].textract.txt - The text content for each of the "LINE" blocks in the above JSON response.
- [image file].textract.expenses.json - The exact JSON returned by the AWS API
- [image file].textract.tables.json - The exact JSON returned by the AWS API
- [image file].textract.forms.json - The exact JSON returned by the AWS API
- [image file].rekognition.json - The exact JSON returned by the AWS API
- [image file].rekognition.txt - The text content for each of the "LINE" types in the above JSON response.