GRI NER Analyzer

This project includes two different model versions for processing and analyzing data from PDF documents and JSON files. This README file provides an overview of the features, technologies, and steps for both models.

Model Versions

This project includes two different model versions:

model_version_1: Focuses on identifying, classifying tables in PDF documents, and extracting GRI standards and page references.
model_version_2: Involves creating and evaluating a Named Entity Recognition (NER) model using report texts from JSON files.

Model Version 1

This model focuses on identifying, classifying tables in PDF documents, and extracting specific information. The technologies used and steps are as follows:

1. Finding PDF Tables

Technologies Used:

Python
Tabula-py: A Python library used for extracting tables from PDF files.
Pandas: A powerful Python library for data processing and analysis.

Steps:

PDF tables from all pages are read using tabula.read_pdf and stored in a tables list.
These tables are processed in a loop and printed.

2. Classifying GRI Tables

Technologies Used:

Python
Scikit-learn: A popular Python library for machine learning. A Naive Bayes classifier is used here.
Pandas: Used to process data and create a training dataset for the model.

Steps:

Extracted tables from PDFs are compared with pre-labeled tables used as training data.
Tables' texts are converted into feature vectors using CountVectorizer.
A Naive Bayes model is trained on these features and then classifies whether the tables are GRI tables.

3. Extracting GRI Standards and Page References from GRI Tables

Technologies Used:

Python
SpaCy: A powerful Python library for natural language processing. A Named Entity Recognition (NER) model is trained and used here.

Steps:

SpaCy's NER model is used to extract specific GRI standards and page references from the text in GRI tables.
The model is applied to the extracted rows from GRI tables to label entities (e.g., gri_standard, page_reference).

4. Evaluating Naive Bayes and NER Models

Technologies Used:

Python
Scikit-learn: Used for evaluating model performance with classification reports, confusion matrices, and accuracy scores.
Matplotlib and Seaborn: Used for visualizing the confusion matrix.

Steps:

The performance of the Naive Bayes model is evaluated on test data, reporting accuracy, and confusion matrix.
The NER model's accuracy is evaluated on sample texts and results are presented as scores.

Model Version 2

This model involves creating and evaluating a Named Entity Recognition (NER) model using NLP techniques on report texts from JSON files. The project steps and technologies used are summarized below:

1. Extracting Data from JSON and Converting to Text Files

Objective: Extract Report_ID and PDF content from JSON files in the flatten_json folder and save each report as a text file named Report_ID.txt.
Technologies: Python, JSON, Regex, File I/O
Description: The JSON file is read, and the PDF content of each report is cleaned and stripped of Unicode characters. The cleaned content is then saved to the corresponding Report_ID.txt file.

2. Creating Dataset for NER Model from JSON Files

Objective: Create training and evaluation datasets for the NER model using labeled data in JSON format.
Technologies: Python, SpaCy, JSON
Description: Data labeling is performed using the NER Annotation Tool. This tool allows for labeling entities (e.g., GRI standards, page references) in the text. Labeled JSON files are used to extract and clean text and entity information, which is then used to create datasets. This dataset is divided into TRAIN_DATA and EVAL_DATA for model training and evaluation.

3. Creating and Training the NER Model

Objective: Create and train a custom NER model using SpaCy.
Technologies: SpaCy, Python
Description: A model is created using SpaCy’s NER pipeline. The model is trained for 100 iterations with the training data and saved as custom_ner_model.

4. Testing and Evaluating the Model

Objective: Test the trained NER model and evaluate its performance.
Technologies: SpaCy, Python, Matplotlib, Seaborn, Pandas
Description: The model is tested on evaluation data, and results are assessed using metrics like Precision, Recall, and F1-Score. Results are visualized with graphs.

5. Processing Unlabeled Text Files with the NER Model

Objective: Process all TXT files with the trained NER model to extract key information such as page_references, direct_answers, and gri_standards.
Technologies: Python, SpaCy

Description: Each report file is processed by the model to extract relevant entities. The extracted data is stored in a dictionary named reports_dict:

reports_dict[report_id] = {
    "page_references": process_page_references(page_references),
    "direct_answers": process_direct_answers(direct_answers),
    "gri_standards": gri_standards
}

6. Matching with PDF Files

Objective: Match the extracted results with the original PDF files.
Technologies: Python, File I/O
Description: The report_id values in reports_dict are matched with the original PDF files for further processing.

7. Generating Outputs

Objective: Generate the final outputs based on the processed data.
Technologies: Python, PDF Processing
Description: Two types of outputs are generated:
1. If direct_answers are found, they are printed. If page_references exist, the corresponding text from the PDF pages is retrieved.
2. If direct_answers are found, they are printed. If page_references exist, the corresponding PDF pages are saved as a separate PDF file in the modified_reports folder.

General Technologies and Tools

Python: The primary programming language used for all data processing, machine learning, and natural language processing steps.
Tabula-py, Scikit-learn, SpaCy: Key Python libraries used for PDF table extraction, machine learning classification, and natural language processing.
JSON: The data format used to store reports and annotations.
Matplotlib and Seaborn: Libraries used for visualizing model performance and evaluation results.
Pandas: A data analysis library used for organizing and inspecting data.
Google Colab: Used for running, testing, and training models.
NER Annotation Tool: An annotation tool used for labeling entities in the text files to create training and evaluation datasets for the NER model. (NER Annotation Tool GitHub repository: ner-annotator GitHub)

Both models aim to automate specific data processing and analysis steps, making the processes more efficient. Relevant libraries and tools have been used to test the performance of both models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRI NER Analyzer

Table of Contents

Model Versions

Model Version 1

1. Finding PDF Tables

2. Classifying GRI Tables

3. Extracting GRI Standards and Page References from GRI Tables

4. Evaluating Naive Bayes and NER Models

Model Version 2

1. Extracting Data from JSON and Converting to Text Files

2. Creating Dataset for NER Model from JSON Files

3. Creating and Training the NER Model

4. Testing and Evaluating the Model

5. Processing Unlabeled Text Files with the NER Model

6. Matching with PDF Files

7. Generating Outputs

General Technologies and Tools

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
model_version_1		model_version_1
model_version_2		model_version_2
README.md		README.md

elifbeyzatok00/GRI-NER-Analyzer

Folders and files

Latest commit

History

Repository files navigation

GRI NER Analyzer

Table of Contents

Model Versions

Model Version 1

1. Finding PDF Tables

2. Classifying GRI Tables

3. Extracting GRI Standards and Page References from GRI Tables

4. Evaluating Naive Bayes and NER Models

Model Version 2

1. Extracting Data from JSON and Converting to Text Files

2. Creating Dataset for NER Model from JSON Files

3. Creating and Training the NER Model

4. Testing and Evaluating the Model

5. Processing Unlabeled Text Files with the NER Model

6. Matching with PDF Files

7. Generating Outputs

General Technologies and Tools

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages