This project includes two different model versions for processing and analyzing data from PDF documents and JSON files. This README file provides an overview of the features, technologies, and steps for both models.
This project includes two different model versions:
- model_version_1: Focuses on identifying, classifying tables in PDF documents, and extracting GRI standards and page references.
- model_version_2: Involves creating and evaluating a Named Entity Recognition (NER) model using report texts from JSON files.
This model focuses on identifying, classifying tables in PDF documents, and extracting specific information. The technologies used and steps are as follows:
Technologies Used:
- Python
- Tabula-py: A Python library used for extracting tables from PDF files.
- Pandas: A powerful Python library for data processing and analysis.
Steps:
- PDF tables from all pages are read using
tabula.read_pdf
and stored in atables
list. - These tables are processed in a loop and printed.
Technologies Used:
- Python
- Scikit-learn: A popular Python library for machine learning. A Naive Bayes classifier is used here.
- Pandas: Used to process data and create a training dataset for the model.
Steps:
- Extracted tables from PDFs are compared with pre-labeled tables used as training data.
- Tables' texts are converted into feature vectors using
CountVectorizer
. - A Naive Bayes model is trained on these features and then classifies whether the tables are GRI tables.
Technologies Used:
- Python
- SpaCy: A powerful Python library for natural language processing. A Named Entity Recognition (NER) model is trained and used here.
Steps:
- SpaCy's NER model is used to extract specific GRI standards and page references from the text in GRI tables.
- The model is applied to the extracted rows from GRI tables to label entities (e.g.,
gri_standard
,page_reference
).
Technologies Used:
- Python
- Scikit-learn: Used for evaluating model performance with classification reports, confusion matrices, and accuracy scores.
- Matplotlib and Seaborn: Used for visualizing the confusion matrix.
Steps:
- The performance of the Naive Bayes model is evaluated on test data, reporting accuracy, and confusion matrix.
- The NER model's accuracy is evaluated on sample texts and results are presented as scores.
This model involves creating and evaluating a Named Entity Recognition (NER) model using NLP techniques on report texts from JSON files. The project steps and technologies used are summarized below:
- Objective: Extract
Report_ID
andPDF
content from JSON files in theflatten_json
folder and save each report as a text file namedReport_ID.txt
. - Technologies: Python, JSON, Regex, File I/O
- Description: The JSON file is read, and the PDF content of each report is cleaned and stripped of Unicode characters. The cleaned content is then saved to the corresponding
Report_ID.txt
file.
- Objective: Create training and evaluation datasets for the NER model using labeled data in JSON format.
- Technologies: Python, SpaCy, JSON
- Description: Data labeling is performed using the NER Annotation Tool. This tool allows for labeling entities (e.g., GRI standards, page references) in the text. Labeled JSON files are used to extract and clean text and entity information, which is then used to create datasets. This dataset is divided into
TRAIN_DATA
andEVAL_DATA
for model training and evaluation.
- Objective: Create and train a custom NER model using SpaCy.
- Technologies: SpaCy, Python
- Description: A model is created using SpaCy’s NER pipeline. The model is trained for 100 iterations with the training data and saved as
custom_ner_model
.
- Objective: Test the trained NER model and evaluate its performance.
- Technologies: SpaCy, Python, Matplotlib, Seaborn, Pandas
- Description: The model is tested on evaluation data, and results are assessed using metrics like Precision, Recall, and F1-Score. Results are visualized with graphs.
- Objective: Process all TXT files with the trained NER model to extract key information such as
page_references
,direct_answers
, andgri_standards
. - Technologies: Python, SpaCy
- Description: Each report file is processed by the model to extract relevant entities. The extracted data is stored in a dictionary named
reports_dict
:reports_dict[report_id] = { "page_references": process_page_references(page_references), "direct_answers": process_direct_answers(direct_answers), "gri_standards": gri_standards }
- Objective: Match the extracted results with the original PDF files.
- Technologies: Python, File I/O
- Description: The
report_id
values inreports_dict
are matched with the original PDF files for further processing.
- Objective: Generate the final outputs based on the processed data.
- Technologies: Python, PDF Processing
- Description: Two types of outputs are generated:
- If
direct_answers
are found, they are printed. Ifpage_references
exist, the corresponding text from the PDF pages is retrieved. - If
direct_answers
are found, they are printed. Ifpage_references
exist, the corresponding PDF pages are saved as a separate PDF file in themodified_reports
folder.
- If
- Python: The primary programming language used for all data processing, machine learning, and natural language processing steps.
- Tabula-py, Scikit-learn, SpaCy: Key Python libraries used for PDF table extraction, machine learning classification, and natural language processing.
- JSON: The data format used to store reports and annotations.
- Matplotlib and Seaborn: Libraries used for visualizing model performance and evaluation results.
- Pandas: A data analysis library used for organizing and inspecting data.
- Google Colab: Used for running, testing, and training models.
- NER Annotation Tool: An annotation tool used for labeling entities in the text files to create training and evaluation datasets for the NER model. (NER Annotation Tool GitHub repository: ner-annotator GitHub)
Both models aim to automate specific data processing and analysis steps, making the processes more efficient. Relevant libraries and tools have been used to test the performance of both models.