This repository contains the solution for Homework 3 of the course Algorithmic Methods of Data Mining.
The main goal of this homework is to explore and analyze Michelin restaurant data in Italy. The project includes multiple tasks such as data cleaning, feature extraction, text processing and visualization. To get all the data that we needed, we did a web scraping process of the Michelin restaurants site and we extracted the html of each restaurant on the page and saved it in a folder called HTML Michelin Restaurants.
Project/
: A folder containing a notebook file with the progress and comments of the tasks performed and a fil.py with the code of an advanced search engine algorithm. Specifically it includes:- Michelin-restaurant-in-Italy-web-scraping.ipynb: A Jupyter Notebook containing Python code, explanations, outputs for each question of the homework and a pseudocode of an algorithm for moving a robot in a warehouse for packages collection.
- Warning: To view all output from the file please download the file and read it in a supported development environment;
- search_engine_filters.py: This file was used in
Michelin-restaurant-in-Italy-web
for Bonus part and is an interactive restaurant search engine built with Python and ipywidgets, which offers advanced filtering capabilities. It features an intuitive interface with dynamic drop-down menus and checkboxes.
- Michelin-restaurant-in-Italy-web-scraping.ipynb: A Jupyter Notebook containing Python code, explanations, outputs for each question of the homework and a pseudocode of an algorithm for moving a robot in a warehouse for packages collection.
files/
: A Dropbox folder containing all output files generated from the homework tasks. In detail:- all_restaurants_data.csv: Dataset containing all information about Michelin restaurants in Italy, collected from the Michelin website. This file is the output for Question 1.3;
- vocabulary.csv: A CSV file that maps each word in the
description
column ofall_restaurants_data.csv
to a unique integer (term_id
). This file is the result of Question 2.1.1; - inverted_index.pkl: A pickle file containing a dictionary mapping each
term_id
to a list of document IDs where that term appears. This file is the output for Question 2.1.1; - coordinates.csv: A CSV file containing all unique city coordinates in the dataset. This file is the output for Question 4;
- top_k_restaurants_map.html: An HTML file displaying the top-k Michelin restaurants based on a custom scoring system, visualized on a map. This is the result for Question 4. To view the map you need to download the html file or view the Notebook file on a development environment such as Visual Studio, Jupyter or Pycharm