PDF Pre-Processing Before OCR with OpenCV

This project demonstrates how to convert PDF files into images and preprocess them using OpenCV to optimize for Optical Character Recognition (OCR). The preprocessing steps include grayscale conversion, noise removal, Gaussian blurring, and binarization to improve OCR accuracy.

Features

Convert PDFs to Images: Uses pdf2image to extract PDF pages as JPEG images.
Grayscale Conversion: Simplifies the image for further processing.
Noise Removal: Applies dilation and erosion to clean up the image.
Gaussian Blur: Reduces noise by smoothing the edges.
Binarization: Converts the image to black-and-white for OCR using Otsu's threshold.

Directory Structure

project/
├── pdfs/            # Place your PDF files here
├── images/          # Extracted images will be saved here
├── pre_processing.py  # Main Python script
└── README.md        # This README file

Dependencies

Make sure you have the following libraries installed:

OS Packages

sudo apt-get update && sudo apt-get install -y poppler-utils tesseract-ocr

Python Packages

pip install opencv-python pdf2image pillow pytesseract

How to Use

Clone this repository:
```
git clone <your-repo-url>
cd project
```
Place your PDFs in the pdfs/ directory.
Run the script:
```
python pre_processing.py
```
Check the images/ directory for the extracted and processed images.

Pre-Processing Techniques Used

Grayscale Conversion: Reduces the image to a single color channel for easy processing.
Dilation & Erosion: Cleans up noise and connects broken parts of objects.
Gaussian Blur: Smooths out small variations in the image.
Binarization: Converts the image to black-and-white for better OCR performance.

Example Output

After running the script, you should see the processed images saved in the images/ directory.

References

Getting Started with Tesseract

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pre_processing.py		pre_processing.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Pre-Processing Before OCR with OpenCV

Features

Directory Structure

Dependencies

OS Packages

Python Packages

How to Use

Pre-Processing Techniques Used

Example Output

References

License

About

Releases

Packages

Languages

License

arthurmf/pdf-pre-processing

Folders and files

Latest commit

History

Repository files navigation

PDF Pre-Processing Before OCR with OpenCV

Features

Directory Structure

Dependencies

OS Packages

Python Packages

How to Use

Pre-Processing Techniques Used

Example Output

References

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages