An attempt to make a GUI app that solves a specific problem - extracting data from a PDF (structured, standardized layout on a smart card of Serbian Registration Certificate) that was created using smart card reader and special software for eading the smart cards issued by Ministry of internal affairs of Republic of Serbia. Then saving that data as an Excel file (.xlsx) and importing it in another Excel template (for creating new documents) or importing it into Access DB (using VBA script). App was exported into .exe file using pyinstaller.
By: Dušan Miletić
Date created: 24. october 2020.
Project uses the following dependencies (which can be installed via pip):
PySimpleGUIQt - for GUI creation
pdfplumber - for PDF parsing
openpyxl - for exporting data in a excel file (.xlsx)
You can run the app in two ways:
Using Python (if you have Python and dependencies installed)
git clone https://github.com/MDule/parse-pdf-gui.git
pip install -r requirements.txt
python pdf.py OR py pdf.py [for Windows] OR python3 pdf.py [for Linux]
OR
Directly by running the .exe file [Windows]
https://raw.githubusercontent.com/MDule/parse-pdf-gui/main/exe/pdf.exe
SHA-256: 4826135F1FD2AE24335EA6AE1EBAF9335537BB4872F5E421014B3EA747507BAA
-
When GUI opens, drag & drop PDF file into the corresponding field (drag & drop functionality was needed because it was easier and faster for everyone than using (file open dialog)) - found out that PySimpleGUI was based on Tkinter, PySide and WxPython which didn't support drag & drop functionality hence PySimpleGUIQt was used.
-
Attempt to read the data from PDF file into python script as a PDF class (using pdfplumber) and it was needed to check if PDF file was created properly using pdf printers (you could extract data) because sometimes, people created a PDF file that was actually an image (i.e. JPG) inside PDF so data extraction was not possible.
-
If data could be extracted from PDF file, the file was read and one large string was created
-
String was then separated by rows (PDF had structured data, every single one had the same structure, just different data)
-
Data needed was then selected from it's corresponding row and placed inside a variable
-
Because the variables were in latin letters (and some documents or template were needed to be in cyrillic letters) they had to be transliterated into cyrillic letters using a custom created module
-
After transliteration, data was exported into a new Excel file - saob_data(.xlsx).
It was easier to create a new excel file that was then imported into a new template excel document or DB then it was to directly export it into excel template (because openpyxl was deleting pictures from excel files at that time) and most of the templates had logos or other images embeded in them. Second reason, because then the python script would be too coupled with the files and applications used (like 5 separate people, with little to none IT knowledge, worked in DB and/or creating new Excel documents, doing office work at the same time and that was a bad idea.)
Mainloop of GUI has weird if statment for running because the app was placed on the server-pc first and all users would connect to it (file server) and use it. It was easier for bug fixes and improvments to have it in one place then on every PC. But later, the possibility to run it on every pc was added, per request.
Provided sample PDF for testing purposes doesn't represent a real vehicles registration ID. The data has been changed to protect sensitive info, swaped with dummy data. Any similarity is entirely coincidental.