Skip to content

A sample script to extract text data from a pdf file, converts it to a pandas data frame, and saves it into a CSV file.

License

Notifications You must be signed in to change notification settings

serhatci/data-extraction-from-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data extraction from a pdf file

A script to extract text data from a pdf file, converts it to pandas data frame and saves it in to a csv file.

CodeFactor

Installation

You can clone below repository:
git clone https://github.com/serhatci/data-extraction-from-pdf.git

install the requirements:
pip install -r requirements.txt

Be sure following pdf files are in the script folder:
ITRCAnnualReportPdf2019.pdf
ITRCAnnualReportPdf2018.pdf

and run the application:
python script/pdf_data_extractor.py

Requirements

Script works Python 3.7 or higher version.

Below libraries should be installed:

pip install pdfplumber~=0.5.25
pip install pandas~=0.25.1

Demonstration of extracted text from pdf file

Below image represents the format of pdf file and the extracted data in the CSV file.

alt text

About

A sample script to extract text data from a pdf file, converts it to a pandas data frame, and saves it into a CSV file.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages