A script to extract text data from a pdf file, converts it to pandas data frame and saves it in to a csv file.
You can clone below repository:
git clone https://github.com/serhatci/data-extraction-from-pdf.git
install the requirements:
pip install -r requirements.txt
Be sure following pdf files are in the script folder:
ITRCAnnualReportPdf2019.pdf
ITRCAnnualReportPdf2018.pdf
and run the application:
python script/pdf_data_extractor.py
Script works Python 3.7 or higher version.
Below libraries should be installed:
pip install pdfplumber~=0.5.25
pip install pandas~=0.25.1
Below image represents the format of pdf file and the extracted data in the CSV file.