- templates
- invoice
- invoice_date
- jpg2text
- ocr
- ocr_file
- ocr_folder
- pdf2jpg
- pdf2txt
clone project letpdf/bash: bash.letpdf.com
git clone https://github.com/letpdf/bash.git
Prepare python environment
cd /media/tom/projects/letpdf/bash
python3 -m pip install -r requirements.txt
import templates for invoices
git clone https://github.com/letpdf/templates.git
PDFs are extremely difficult to scrape. Converting them to text files can make extracting their data significantly easier. I have focused on the widely used pdfminer package from python.
Convert from PDF files to this:
{ File Name: 2112.pdf Invoice Number: INV002112 Invoice Date: 13-Jun-2016 Due Amount: Rs 1,661.09 }
{ File Name: 2137.pdf Invoice Number: INV002137 Invoice Date: 22-Jun-2016 Due Amount: Rs 45.76 }
{ File Name: 2138.pdf Invoice Number: INV002138 Invoice Date: 22-Jun-2016 Due Amount: Rs 45.76 }
Update the config.json file and set the path as shown bellow:
"src" : "Path where the zip file will be exctracted.",
"des" : "path to the text file in which the complete data of pdf is exctracted.",
"zip" : "complete path to the where the zip file is."
To extract the data from the set of pdf files, the invoive2textdata.py file is executed with the help of command:
python invoive2textdata.py config.json
invoice2textdata.py : Is used to convert the file to text and find File Name, Invoice number, Invoice
date, Due amount.
functions.py : Has set of functions like:
1. convert(),
2. find_invoice_number(),
3. find_date(),
4. find_amount()
config.json : Has path to Source and Destination.