GitHub - seoester/docker-pdfocr: A Docker image that automatically converts image-based PDFs to searchable PDFs.

This project creates a Docker image that automatically turns image-based (scanned) PDFs into searchable PDFs.

This is done primarily through the use of the excellent gkovacs/pdfocr ruby script.

In addition to that script, ghostscript is also used to force the page size to A4. Ideally that feature should be parameterised, but for now this is hard-coded.

This container requires you to create two volumes, and inbox and an outbox. Any PDF files dropped in the inbox are converted to searchable PDFs and dropped in the outbox.

Notes:

The default process in the container is incrond. This watches the inbox for new files and immediately triggers the OCR script when new files are added.
A user account matching the UID and GID of the outbox owner is created at container creation time. This user account is used to generate the searchable PDF files, so the reulting files should be readable and writable for the owner of the outbox.
After processing the original PDF is moved to inbox/processed. This presumes that the inbox is also writable to the owner of the outbox. To keep things simple it's probably best that the same user owns both the inbox and the outbox.

Example:

$ cd ~
$ mkdir ocr_in ocr_out
$ docker run -d --name pdfocr -v ~/ocr_in:/inbox -v ~/ocr_out:/outbox netservers/docker-pdfocr

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Dockerfile		Dockerfile
README.md		README.md
do_ocr		do_ocr
entrypoint.sh		entrypoint.sh
incron.pdfocr		incron.pdfocr
pdfocr.rb		pdfocr.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

seoester/docker-pdfocr

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages