Skip to content

Commit

Permalink
new optimized version for data extraction API
Browse files Browse the repository at this point in the history
  • Loading branch information
ahmedkhemiri95 committed Feb 21, 2023
1 parent 2074a4b commit 72f787f
Show file tree
Hide file tree
Showing 46 changed files with 335 additions and 285 deletions.
3 changes: 0 additions & 3 deletions .gitignore

This file was deleted.

76 changes: 0 additions & 76 deletions CODE_OF_CONDUCT.md

This file was deleted.

6 changes: 6 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
FROM python:3.7
COPY . /app/
WORKDIR /app
RUN pip install -r requirements.txt
ENTRYPOINT ["python3"]
CMD ["app.py"]
21 changes: 0 additions & 21 deletions LICENSE

This file was deleted.

53 changes: 0 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,53 +0,0 @@
# PDFs-TextExtract
Python Multiple and Large PDF Documents Text Extraction - Python 3.7
![Logo](XPDF.jpg)



## Introduction
**As a Data Scientist , You may not stick to data format.**

PDFs is good source of data, most of the organization release their data in PDFs only. **As AI is growing, we need more data for prediction and classification**; hence, ignoring PDFs as data source for you could be a blunder.

*As you know PDF Processing comes under text analytics.*


Most of the Text Analytics Library or frameworks are designed in Python only, this gives a leverage on text analytics. You can never process a pdf directly in exising frameworks of Machine Learning or Natural Language Processing. Unless they are proving explicit interface for this, **we have to convert pdf to text first.**
## Problematic
Most Python Liabiries for Pdf Processing such as PyPDF2 and Pdfminer.six perform in text extraction task, but this performance is limited to a small and simple PDF document.

That's why, **PDFs-TextExtract** project developed to **extract text from multiple and large pdf documents.**

## Setup Environment

#### For use with MacOS X, the scripts will need to be modified to remove "/PDFs-TextExtract" from the path.

- **Step 1:** Select Version of Python (Python 3.7) to Install from [Python.org](https://www.python.org/) website.
- **Step 2:** Download Python Executable Installer.
- **Step 3:** Run Executable Installer.
- **Step 4:** Verify Python Was Installed On Windows.
- **Step 5:** Verify Pip Was Installed.
- **Step 6:** Add Python Path to Environment Variables (Optional).
- **Step 7:** Install Python extension for your IDE (Visual Studio Code).
- **Step 8:** Now you’ll be able to execute python scripts with your IDE (Visual Studio Code).
- **Step 9:**

## Install dependencies

pip install -r requirements.txt

## Usage
- **Step 1:** Open **..\PDFs-TextExtract-master\samples** folder and put your PDF Documents inside.
- **Step 2:** Execute **..\PDFs-TextExtract-master\Scripts\merged.py** script.
- **Step 3:** Execute **..\PDFs-TextExtract-master\Scripts\spliter.py** script.
- **Step 4:** Execute **..\PDFs-TextExtract-master\Scripts\extract_text.py** script.
- **Step 5:** Open **..\PDFs-TextExtract-master\output** and you will find the result there.

## With bash script
Execute
sh main.sh

## Resources
- [Overview about PDF Processing with Python](https://towardsdatascience.com/pdf-preprocessing-with-python-19829752af9f)
- **pdf2txt** tool forked from [pdfminer.six](https://github.com/pdfminer/pdfminer.six) project.
- **merger** and **spliter** tools forked from [PyPDF2](https://github.com/mstamy2/PyPDF2) project.
63 changes: 0 additions & 63 deletions Scripts/extract_text.py

This file was deleted.

15 changes: 0 additions & 15 deletions Scripts/merged.py

This file was deleted.

39 changes: 0 additions & 39 deletions Scripts/spliter.py

This file was deleted.

Binary file added __pycache__/extraction.cpython-311.pyc
Binary file not shown.
Binary file added __pycache__/global_common.cpython-311.pyc
Binary file not shown.
Binary file added __pycache__/splitting.cpython-311.pyc
Binary file not shown.
Loading

0 comments on commit 72f787f

Please sign in to comment.