-
Install Docker CE
-
Clone this repo:
git clone https://github.com/simonkeng/pdf_parser.git
-
cd
intopdf_parser
directory. -
Build docker image from the
Dockerfile
:
docker build -t pdf_parser .
Run the container and execute the python script passing in a document:
docker run -i -t pdf_parser bash -c "python pdf_rip.py test_data.pdf"
You can also extract from multiple files, just place all your PDFs in one folder and copy it over to your docker container.
docker cp pdfs/ 609d09bb400f:/tmp/pdfs/
..replacing 609d09bb400f
with your container ID. Now we can run the batch script within a new container.
docker run -i -t pdf_parser bash -c "python batch.py pdf/"
This command will return a container ID. To ensure it ran, and to check the status:
docker logs <containerID>