MetaScan 🖼️: Image Scanning 📸 and Metadata Management 📚

Project Overview 🎯

MetaScan is an innovative prototype solution crafted to streamline the management of extensive scanned document archives. It resolves the typical issues of document organization and retrieval by establishing a robust metadata archival system. Beyond just digitising documents, MetaScan intelligently identifies, extracts, and archives crucial information from each scanned document. This feature significantly simplifies document retrieval, empowering users to find specific documents based on their content, not just their titles. By enhancing how we archive and retrieve scanned documents, MetaScan aims to drastically boost efficiency and productivity in environments where document scanning is a regular necessity.

This project is created for my talk in the upcoming Ubuntu Summit 2023 along with my ongoing work for PAPPL. Here is the poster for the same:

The talk schedule will be announced here soon.

Workflow ⚙️

Document Scanning: The user initiates the process by scanning a document using the built-in scanner interface in MetaScan, powered by the Python-based SANE library. In case you do not have the scanner you can still test the application using the test scanning process which takes a random image from the testImages folder. The scan currently is configured for a default of 300 dpi.
OCR Processing: Once the document is scanned, the Tesseract OCR engine is employed to convert the image into readable-text format. This step allows us extract the necessary text information present in the image and which can be further used for metadata analysis.
Metadata Extraction: The converted text is then analysed using various natural language processing (NLP) techniques. Processes like stop word removal, lemmatisation, and named entity recognition, enable MetaScan to extract valuable metadata from the document.
Archiving: Following the extraction, the metadata along with the scanned image is stored in an SQLite database. This compact and powerful database ensures efficient storage and quick retrieval of the documents.
Data Access via Backend Server: The Flask-based backend server exposes the data in the SQLite database , allowing interaction between the backend and the frontend.
User Interaction via Frontend: Users interact with the MetaScan system through a simple user-friendly frontend designed with React. Here, they can view, navigate, and manage their digitised documents.
Document Retrieval: To find a specific document, users utilise the search tool that leverages the metadata to locate documents. This process greatly simplifies document retrieval, as users can search based on the content of the documents rather than relying solely on file names.

MetaScan ensures a seamless transition from a physical document to a digitised, easily retrievable version stored safely within an integrated metadata archival system.

Tech Stack 🏗️:

Python : Core scripting language
SANE Library : Scanner interfacing
Tesseract : OCR processing
NLP Techniques : Lemmatisation , Stop word removal , Named Entity Recognition
SQLite : Data storage and retrieval
Flask : Backend server setup
React : Frontend user interface

Getting Started : Docker Setup🚀

Docker Setup
You can follow instructions here to get Docker downloaded for your machine.
Incase you have docker setup you can skip this step.
Creating Folders
Create a folder named scannedImages and a file named image_database.db on your system. Make sure to copy the entire paths to both of these.

Clone the repository from GitHub

git clone https://github.com/Kappuccino111/MetaScan.git

Navigate to the project directory
```
cd MetaScan
```

Install MetaScan

sudo docker build -t metaScan .
sudo docker run -p 5000:5000 -p 3000:3000 -v /path/to/scannedImages:/app/scannedImages -v /path/to/image_database.db:/app/image_database.db -it metaScan

Getting Started : Normal Setup🚀

Clone the repository from GitHub

git clone https://github.com/Kappuccino111/MetaScan.git

Navigate to the project directory
```
cd MetaScan
```
Create a virtual environment
```
python3 -m venv your_env_name
```
Activate the virtual environment
On macOS and Linux:
```
source env/bin/activate 
```

Installing Tesseract Binaries

On Linux

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

On macOS

brew install tesseract

Installing SANE binaries

On Linux

sudo apt get install sane
sudo apt-get install sane sane-utils xsane

On macOS

brew install sane-backends

Install the required Python packages

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Setup the front-end server

a) Node.js and npm Installation
- On macOS
```
brew install node
```
- On Ubuntu/Linux
```
sudo apt-get update
sudo apt-get install nodejs npm
```
b) Setup the React Server
```
cd front-end
npm install 
```
Run the Application

The application can be run using
```
./run.py
```
OR
```
python3 run.py
```

Demo

Once you execute the run.py file you will be led to the following CLI Interface:

You can Scan in test mode or connect to a real Scanner by running Option 1.
Once you have a Scanned Image or a Test Image you can select Option 2. The webpage is then accessible at http://localhost:3000.

Searching in the metadata

Demo.webm

Future Packaging 📦

The prototype for MetaScan has been developed in Python , with the project's design executed to accommodate future expansion and compatibility with other programming languages, such as C or C++. Each component of the code has been designed to be individually extractable, which enables the enhancement of specific functions or components without impacting the entire system.

The future scope includes the creation of a wrapper to facilitate integration with C or C++ code and better metadata extraction for more enhanced searches. This will allow us to leverage Scanning libraries being developed in these languages, such as for PAPPL or other open-source scanning software, thereby extending MetaScan's capabilities and enhancing its performance.

Current Work 🚧

I am currently working on making a Sandboxed-Scanner Application Framework for PAPPL. The PR for ongoing work can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
front-end		front-end
testImages		testImages
Dockerfile		Dockerfile
README.md		README.md
config.py		config.py
database.py		database.py
file_system_handler.py		file_system_handler.py
metadata_extraction.py		metadata_extraction.py
requirements.txt		requirements.txt
run.py		run.py
scan.py		scan.py
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MetaScan 🖼️: Image Scanning 📸 and Metadata Management 📚

Project Overview 🎯

Workflow ⚙️

Tech Stack 🏗️:

Getting Started : Docker Setup🚀

Getting Started : Normal Setup🚀

Demo

Future Packaging 📦

Current Work 🚧

About

Releases

Packages

Languages

Kappuccino111/MetaScan

Folders and files

Latest commit

History

Repository files navigation

MetaScan 🖼️: Image Scanning 📸 and Metadata Management 📚

Project Overview 🎯

Workflow ⚙️

Tech Stack 🏗️:

Getting Started : Docker Setup🚀

Getting Started : Normal Setup🚀

Demo

Future Packaging 📦

Current Work 🚧

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages