Optical Character Recognition (OCR) for Image Files

This Python script uses Optical Character Recognition (OCR) to extract text from image files. The images are preprocessed to enhance the text's visibility and are then processed with Tesseract, an OCR engine. This program supports multiple image formats, including PNG, JPG, JPEG, BMP, and TIFF.

Features

Image preprocessing to enhance text visibility
Parallel processing for efficiency
Skipped row tracking for data verification
Extracted data output to an Excel file

Prerequisites

To run this script, you need Python 3.6+ and the following libraries:

OpenCV
Tesseract
pytesseract
pandas
numpy
concurrent.futures (built-in)
shutil (built-in)
os (built-in)
warnings (built-in)

You also need to have Tesseract OCR installed on your machine. You can download it here.

Usage

Clone this repository:

git clone https://github.com/NripeshN/picture-to-df.git
cd repository

Run the script:

python pic_to_df.py

Ensure you have the correct directory path and necessary configuration details set in the config dictionary in the if __name__ == '__main__' block. The configuration dictionary includes settings for image dilation, blur, the path for saving preprocessed images, Tesseract commands, and the directory for storing processed files.

The script scans the specified directory for image files, preprocesses them, and then uses Tesseract to extract text from the images. The extracted text is then converted into a DataFrame and written to an Excel file. Any rows of text that the script cannot process are noted in a 'skipped' DataFrame, which is also written to the Excel file. The images are moved to a processed files directory once the text has been extracted.

The script outputs an Excel file (output.xlsx by default) containing the extracted data and any skipped rows.

Support

If you encounter any issues, please open an issue in this repository.

Contributing

Pull requests are welcome.

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
new_images		new_images
processed_files		processed_files
tesseract		tesseract
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
output.xlsx		output.xlsx
pic_to_df.py		pic_to_df.py
pic_to_df_E.py		pic_to_df_E.py
preprocessed.png		preprocessed.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optical Character Recognition (OCR) for Image Files

Table of Contents

Features

Prerequisites

Usage

Support

Contributing

License

About

Releases

Packages

Contributors 2

Languages

License

NripeshN/picture-to-Excel

Folders and files

Latest commit

History

Repository files navigation

Optical Character Recognition (OCR) for Image Files

Table of Contents

Features

Prerequisites

Usage

Support

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages