The Problem: Classify Wikipedia Disease Articles

This repository provides a Deep Learning based solution to classify Disease Articles from Wikipedia. It includes python scripts to pre-process html documents, train new model by tweaking model hyperparameters or use existing model for inference.

It also includes model weights trained based on 13693 html documents (3693 positive + 10000 negative). The inputs to the model include:

Title : Title of the wikipedia page extracted using <title></title> tags
Table of Contents: List of section headers from table of contents section of the wikipedia page
Introduction: First paragraph from the Introduction section of the wikipedia page

Performance Metrics:

Type	Accuracy
Train	99.5%
Validation	97.66%
Test	98.46%

Instructions To Run The Solution

The user is required to clone this repository and spin-up the docker container using following commands. Once the docker container is spawn, a flask app is launched which can receive requests and produce results based on it.

Build Docker Image:

In repo root: sh ./tasks/build_api_docker.sh

Spin-up Docker Container:

In repo root: sh ./tasks/run_api.sh

Classify Given html Files Into Positve/Negative Class

The user needs to add html files corresponding to wikipedia articles to the test cases directory (api/test_cases). The files from this directory are read using the following commands and a request is posted to Flask app running inside the docker container to produce positve/negative flags for the articles.

In repo root: sh ./tasks/test_api.sh

Stop Docker Container

In repo root: sh ./tasks/stop_api.sh

Instructions To Train a New Model

Add Json files for train, validation and test datasets to api/app/cleaned_datasets directory. The structure of the json file should be as described in wiki_dataset.py file in api/app folder. Alternatively, create new directory - 'training'(containing sub-directories - 'positive' and 'negative' consisting of respective html documents) in api/app/dataset directory and run pre_process.py file.
Update the params.yaml file in api/app/model directory to tweak the model architecture. This step can be skipped in order to continue with existing architecture
Run the following command in terminal with appropriate batch_size, n_epochs and model_file_name python train.py './model/params.yaml' batch_size n_epochs model_file_name

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
tasks		tasks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Problem: Classify Wikipedia Disease Articles

Instructions To Run The Solution

Build Docker Image:

Spin-up Docker Container:

Classify Given html Files Into Positve/Negative Class

Stop Docker Container

Instructions To Train a New Model

About

Releases

Packages

Languages

digvijayingle016/wiki_text_classifier

Folders and files

Latest commit

History

Repository files navigation

The Problem: Classify Wikipedia Disease Articles

Instructions To Run The Solution

Build Docker Image:

Spin-up Docker Container:

Classify Given html Files Into Positve/Negative Class

Stop Docker Container

Instructions To Train a New Model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages