At CER, we receive applications from companies containing thousands of pages of documents. We wanted to develop a Machine Learning Algorithm to differentiate the pages which are alignment sheets (or maps) from pages which are not maps.
The problem stated above was tackled by building and training a bunch of machine learning based classification algorithms using the features that were extracted from each page of a PDF file using Python PyMuPDF library. The names of some of the features that were extracted are area of images in a page, number of images in a page, count of words in a page. In addition, few more features were generated by simply checking if the page has certain words such as "North" or "N", "Figure", "Map", "Alignment Sheet" or "Sheet", "Legend", "scale", and "kilometers" or "km".
After feature extraction, different classification models were compiled and trained such as, XG Boost Classifier, Support Vector Classifier, Decision Tree Classifier, Random Forest Classifier, Random Forest Regressor and XG Boost Regressor. Post model training, the model accuracy and performance was evaluated on the validation dataset and the unseen data i.e. test dataset. After evaluation phase, the best performing model was saved in models repo for future use.
Note: The result from the regressor models was converted into binary output using sigmoid function
, hence, these regression models are referred as classification models here.
The model training part has not been discussed in depth here. Rather, we present below the structure of this repo and how to run the jupyter notebook files
-
0. Download PDFs and extract features of Alignment Sheets.ipynb:
This file contains the funtions to download the PDF documents and to extract the features from each page of a PDF file. The ouput from this jupyter notebook file is a CSV containing all the extracted features -
1. Save Alignment Sheets.ipynb:
This file takes feature CSV as input and classify whether a PDF page is an alignment sheet or not by using the best performing classifier that we saved in repo models. The later section of this jupyter notebook file contains the functions to extract and assign the titles for alignment sheets
- Clone or download github files into a local directory
- Install required python packages from requirements.txt file by creating virtual environment
- Activate the virtual environment
- Open Jupyter notebook and run the files in the following order and observe results:
0. Download PDFs and extract features of Alignment Sheets.ipynb
1. Save Alignment Sheets.ipynb