A universal category prediction tool based on Wikipedia created for the DTU course 02807 - Computational Tools for Data science.
The project was created by Sabine Fie Hansen (s134583) and Rasmus Arpe Fogh Jensen (s134843).
Since the code is not run by simply calling a main method, a guide has been made.
To see which external python libraries are needed to run the code, we refer to requirements.txt
or environment.yml
. To setup a virtual environment using anaconda run
conda env create -f environment.yml
which will create the virtual environment 02807-project, which can be activated by
source activate 02807-project
To setup a virtual environment using pip first create the virtual environment with python 3.7 and then run
pip install -r requirements.txt
Then you may activate the created environment.
We've included a subset of datafiles used in the repository (the original files were all about 50mb), so preprocessing is not neccesary in order to test the models or run the category prediction app. However, if you wish to create the dataset using the same approach, the process can be divided into three steps.
- Preprocess all of Wikipedia
- Extracting article IDs for each category
- Matching preprocessed Wikipedia articles
In order to preprocess all articles of Wikipedia you first need to download a
Wikipedia dump from Download wiki dump from https://dumps.wikimedia.org/enwiki/latest/.
In this project we used enwiki-latest-pages-articles.xml.bz2
.
In wikipedia_handlers.py
the function generate_corpus()
needs to be called.
You can either give in_file as argument corresponding to the path to Wikipedia dump
or place the exact Wikipedia dump we used in the ./data/
dir.
The function will preprocess all of Wikipedia articles into ./data/articles/
.
Warning: Preprocessing all of Wikipedia is very time consuming and will fill up your local drive.
In wikipedia_api_handler.py
the function extract_page_ids_for_category()
can be called. You can need to give the parameter category
corresponding to
the category you want to extract. The functions saves two files in ./data/categories/
,
one file with article ids and one file with queued subcategories.
It is possible to continue extracting article
ids for a category using continue_category_page_extraction()
. See the __main__
method of wikipedia_api_handler.py
on how to call the functions.
Sidenote: You need to extract article ids seperately for each category in order to abide with Wikipedias thumb rule of 1 request pr. second.
If you have succesfully completed the two above steps, simply call the create_dataset()
function
in wikipedia_handlers.py
. This will generate the final dataset.
To run each model seperate, simply call the ml_models.py
with either argument
SVC or LSH specficying the specific model to be run. This will train the model
and evaluate the performance on test.
If you want to time both models, then the performance_comparison.py
script may be called.
The DataHandler.py
file includes a python class DataHandler
. The Datahandler
controls all of the preprocessing and test/train splits depending on arguments.
To run the app, first open the file ./wiki_catpred_app/predidiction/views.py
and specify
the super_path
variable to match the root of the project on your local machine.
Then simply go to directory ./wiki_catpred_app/
and execute
./manage.py runserver
. This will launch the app on localhost using port 8000. Go to the
URL http://127.0.0.1:8000/prediction/
in order to predict a new document.
Note that if you want to test an unseen document, you first need to extract the text
and tokenize the document using tokenize_document()
function in utils.py
.
The __main__
method in utils.py
show how to tokenize an unseen document, so that it works with the webapp/models.