Classify Alignment Sheets

At CER, we receive applications from companies containing thousands of pages of documents. We wanted to develop a Machine Learning Algorithm to differentiate the pages which are alignment sheets (or maps) from pages which are not maps.

Sample Maps:

Sample Non-Maps:

Approach

The problem stated above was tackled by building and training a bunch of machine learning based classification algorithms using the features that were extracted from each page of a PDF file using Python PyMuPDF library. The names of some of the features that were extracted are area of images in a page, number of images in a page, count of words in a page. In addition, few more features were generated by simply checking if the page has certain words such as "North" or "N", "Figure", "Map", "Alignment Sheet" or "Sheet", "Legend", "scale", and "kilometers" or "km".

After feature extraction, different classification models were compiled and trained such as, XG Boost Classifier, Support Vector Classifier, Decision Tree Classifier, Random Forest Classifier, Random Forest Regressor and XG Boost Regressor. Post model training, the model accuracy and performance was evaluated on the validation dataset and the unseen data i.e. test dataset. After evaluation phase, the best performing model was saved in models repo for future use.

Note: The result from the regressor models was converted into binary output using sigmoid function, hence, these regression models are referred as classification models here.

The model training part has not been discussed in depth here. Rather, we present below the structure of this repo and how to run the jupyter notebook files

Description of the folder structure

0. Download PDFs and extract features of Alignment Sheets.ipynb: This file contains the funtions to download the PDF documents and to extract the features from each page of a PDF file. The ouput from this jupyter notebook file is a CSV containing all the extracted features
1. Save Alignment Sheets.ipynb: This file takes feature CSV as input and classify whether a PDF page is an alignment sheet or not by using the best performing classifier that we saved in repo models. The later section of this jupyter notebook file contains the functions to extract and assign the titles for alignment sheets

How to use the files in this repo?

Clone or download github files into a local directory
Install required python packages from requirements.txt file by creating virtual environment
Activate the virtual environment
Open Jupyter notebook and run the files in the following order and observe results:
- 0. Download PDFs and extract features of Alignment Sheets.ipynb
- 1. Save Alignment Sheets.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
bert-master		bert-master
data		data
imgs		imgs
.gitignore		.gitignore
BERT-Text-Classification.ipynb		BERT-Text-Classification.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classify Alignment Sheets

Sample Maps:

Sample Non-Maps:

Approach

Description of the folder structure

How to use the files in this repo?

About

Releases

Packages

Languages

License

nipun-goyal/BERT-Text-Classification

Folders and files

Latest commit

History

Repository files navigation

Classify Alignment Sheets

Sample Maps:

Sample Non-Maps:

Approach

Description of the folder structure

How to use the files in this repo?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages