UCL5 MieMie is an Natural Language Processing (NLP), data mining and web scraping engine to be used across UCL through research and teaching.
Our goal is to scrape, map and generate classifiers with the intention of generating an overview of activity taking place currently at University College London through teaching and ongoing research.
The project goals are split into 3 sections:
- Perform keyword searches on Scopus research publications of UCL academics and researchers.
- Map UCL modules to UN SDG's (United Nations Sustainable Development Goals).
- Map UCL researchers to IHE (Institute of Healthcare Engineering) subject areas and areas of expertise.
The solution is to design, tune and implement NLP (Natural Language Processing) and machine learning algorithms to classify text for a given set of topics, each of which is categorised using an extensive set of keywords. The classification, training results and validation are then interfaced on a Django web-application which allows for interactivity via keyword searches, visual interpretation of the data, NLP and SVM model predictions.
Our Django web application allows for performing keyword searches across UCL modules and research publications.
For mapping UCL modules to UN SDG's, we first compile an extensive set of keywords for 17 SDG's, including Misc (general set of keywords for all SDG's). We then use LDA (semi-supervised Latent Dirichlet Allocation using collapsed Gibbs sampling) as a first-step to learn a much larger and more representative sdg-keyword distribution and module-sdg distribution. The module-sdg distribution results are used to extract the most related SDG's for particular modules, which are then used as labels for training a more sophisticated, supervised machine learning algorithm in the form of a Support Vector Machine (SVM).
For mapping UCL research publications to IHE areas of expertise, we use the same methodology described for our SDG mapping. Initially using the LDA algorithm to annotate publications and subsequently leveraging SVM to perform final stage classifications. The final goal is to extract researchers for each of the IHE & Digital Health specialities and perform mapping against a selection of approaches to research, forming the Bubble Chart.
The following website gives a greater overview of the challenges and design decisions that were made, implementation using the Python programming language and research undertaken.
- Marilyn Aviles - marilyn.aviles@ucl.ac.uk
- Aashvin Relwani - aashvin.relwani.19@ucl.ac.uk
- Albert Mukhametov - albert.mukhametov.19@ucl.ac.uk
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
Create a folder for this project, open a Terminal / Command Prompt at that folder and run the following commmand:
$ git clone https://github.com/thatguy1104/MieMieDjango-Web-App.git
$ cd NLP-Data-Mining-Engine
Creating a virtual development environment:
$ python3 -m venv venv
Activate the environment
$ source venv/bin/activate
You should see (venv) at the start of the terminal string (which ends in a $)
$ cd src
Installation of libraries required:
$ pip3 install -r requirements.txt
Note: running a command impacts files, as well as certain database contents (possibility of overwriting existing values). Chronologically coherent sequence of commands is outlined below.
$ python3 global_controller.py LOAD publications
$ python3 global_controller.py LOAD modules
$ python3 global_controller.py NLP run_LDA_SDG
$ python3 global_controller.py NLP run_LDA_IHE
$ python3 global_controller.py NLP module_string_match
$ python3 global_controller.py NLP scopus_string_match_SDG
$ python3 global_controller.py NLP scopus_string_match_IHE
$ python3 global_controller.py NLP predict_scopus_data
python3 global_controller.py NLP create_SDG_SVM_dataset
python3 global_controller.py NLP create_IHE_SVM_dataset
$ python3 global_controller.py NLP run_SVM_SDG
$ python3 global_controller.py NLP run_SVM_IHE
$ $ python3 global_controller.py NLP validate_sdg_svm
$ python3 global_controller.py SYNC synchronize_raw_mongodb
$ python3 global_controller.py SYNC synchronize_mongodb
$ python3 global_controller.py SYNC synchronize_bubble
Prior to scraping, firstly ensure the file titled “cleaned_RPS_export_2015.csv” in directory src/main/SCOPUS/GIVEN_DATA_FILES is up-to-date. The file should contain a column titled “DOI”. The scraper examines given DOIs, compares them to existing records and scrapes only those not already present in the database. It is vital for the file to retain its structural integrity to avoid any unexpected script errors. Secondly, setup Scopus API key. Once the key has been set up, ensure that you are on a UCL network (either using UCL WI-FI or connected to a UCL virtual machine). It can also be achieved via UCL VPN (instructions). Finally, run the following command to initiate scraping:
python3 global_controller.py SCRAPE_PUB
Register UCL API here and initialise departmental data, which is necessary to perform prior to scraping. It can be done by running the command below:
python3 global_controller.py MOD initialise
Reset current module data records
python3 global_controller.py MOD resetDB
Lastly, to reflect the current student population data, keep the file “studentsPerModule.csv” up-to-date in directory src/main/MODULE_CATALOGUE/STUDENTS_PER_MOD. To synchronise that data with the database run the following command:
python3 global_controller.py MOD updateStudentCount
Finally, module scraping can be performed. Ensure the MySQL credentials are valid and up-to-date in config.ini file (under SQL_SERVER section) and run the following command to freshly scrape the UCL module catalogue data:
python3 global_controller.py MOD scrape
Ensure you are in the global project directory and run the following command to execute all unit tests:
python3 -m unittest discover src/test -p 'test_*.py'
- UCL API - UCL Department details
- Scopus API - Source for UCL publication data
- UCL Module Catalogue - Source for UCL module data
- Latent Dirichlet Allocation (LDA) - unsupervised topic modelling algorithm used to classify text in a document to a topic
- GuidedLDA - implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling
- Support Vector Machine (SVM) - SVM linear classifier with SGD (Stochastic Gradient Descent) training
For the versions available, see the tags on this repository.
This project is licensed under the MIT License - see the LICENSE.md file for details