Skip to content

FredBaos/Ntds_project_team02

Repository files navigation

Wikipedia Recommender System

Welcome to our project repository for the Network Tour of Data Science course at EPFL !

We implemented a query-based search engine for Wikipedia articles related to various Machine Learning topics.

In other words, given a query our system will retrieve and suggest articles with similar semantic contents. Moreover, we provide a graph visualisation tool to interact with the query engine.

More details about this ML system can be found in the project [report](Team 02 - Project report.pdf).

How to reproduce results:

Note that 'wd' is the directory containing the run.sh script (in the project folder).

  • Run the command export PYTHONPATH=wd

NOTE: if you want to use a virtual environment, run the following:

  • python3 -m venv ntds
  • echo 'export PYTHONPATH=wd' >> ntds/bin/activate

From wd, run the following:

  • Run the command sudo apt install build-essential python-dev libxml2 libxml2-dev zlib1g-dev bison flex
  • pip3 install -r requirements.txt
  • pip3 install pymagnitude==0.1.120 --no-binary :all:
  • Specify INITIAL_FILENAME in config.py. This is the name of the file produced on Seealsology (to put in the data folder). The seeds to scrap the graph are given in the seeds_seealsology.txt file (we used a distance of 2).
  • Download the wiki-news-300d-1M-subword.magnitude file at and put it into the data folder.
  • Execute the run.sh script (takes a few minutes to run).
  • Run exploration.ipynb and/or exploitation.ipynb for the respective analysis.

Interactive Visualisation:

After having done the previous part, run the command: python3 visualization/app.py 8888

NOTE: if you want to put the app online like on the following link, you have to do all the above installs in "sudo" mode, and run the following command instead: sudo PYTHONPATH=wd python3 visualization/app.py 80. Another option is that you enable port 80 for current user.

You can choose any of the three methods to perform a query.

For multiple concepts, please separate by a comma, e.g. machine learning,text processing The port 80 must be opened for external access if you use a server.

  • By clicking on a node, 'Chosen node' link will redirect you to the corresponding web page.
  • Only the page title of nodes that best fit the query as well as the neighbours are shown.
  • Red edges mean that the pages are present in the 'See also' section on Wikipedia website.
  • The color of the nodes represents the cosine similarity score.

This web app has been only tested on Chrome for Linux (78.0.3904.70).

Files breakdown:

run.sh : shell script executing the acquisition, exploitation and visualisation tasks.

Acquisition:

  • acquisition_helpers.py : various helpers for the acquisition.py script,
  • acquisition.py : loads the dataset and augments it with urls and keywords extraction. Create df_node dataframe which contains node information and df_edge which contains edge relation.

Exploration:

Exploitation:

Visualization:

  • app.py: runs the visualisation app on a dedicated server
  • create_visu.py: creates and saves the graph visualisation
  • utils.py: various helpers

Helpers:

Data:

  • Data: contains every file loaded and generated by the different modules.

Authors

  • EL Amrani Ayyoub
  • Micheli Vincent
  • Myotte Frédéric
  • Sinnathamby Karthigan

License

Wikipedia Recommender System - Network Tour of Data Science EE-558 - EPFL - Fall 2019 - Team 2

Copyright (c) 2019 EPFL

This program is licensed under the terms of the GPL.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published