Skip to content

borisghidaglia/data-engineering

Repository files navigation

Not Tripadvisor

This project has been developped during the course DSIA-4203C. The goal was to crawl and scrape data from a chosen website. We decided to retrieve users and reviews from Tripadvisor.

Table of content

  1. Getting Started
    1.1 Prerequisites
    1.2 Installing and Running
    1.3 The app doesn't run ?
  2. User Guide
    2.1 Home Page
    2.2 Search Page
    2.3 Graph Page
  3. Reference Guide
    3.1 Crawling Ex Nihilo
    3.2 Why Elasticsearch ?

Getting Started

These instructions will get you a copy of the project up and running on your local machine.

Prerequisites

Thanks to docker's magic, it's really the only thing you need to setup - ok, along with docker-compose. You can install docker here. If you don't fit the system requirements, don't forget to check Docker Toolbox. As for Docker Compose : it's over here.

Installing and Running

Clone this repository with :

$ git clone https://github.com/borisghidaglia/data-engineering.git

Go into the repository :

$ cd data-engineering/

Then, you need to download the dumped data we prepared for you, so that it can be restored in the mongo container you will start. - Make sure you have curl and unzip installed

$ mkdir data && \
mkdir data/dump && \
cd data/dump && \
curl -o tripadvisor_dump.zip https://perso.esiee.fr/~prolonga/data/tripadvisor_dump.zip && \
unzip tripadvisor_dump.zip && \
rm tripadvisor_dump.zip && \
cd ../..

If you downloaded Docker Desktop, make sure the app is running. You've got Docker Toolbox ? If the docker-machine isn't started yet, run :

$ docker-machine start default

Finally :

$ docker-compose up -d

Once app, mongo, mongo_seed and elastic are built and up, and after waiting for a little while, you will be able to use the app in your browser.

The app doesn't run ?

Do not panic. Docker will help you understand what is going on by letting you access service logs like so :

docker-compose logs -f --tail=50 <service_name>
  • -f is used to "follow" the logs as it is beeing written by the service output.
  • --tail=30 restrict the output to the last 30 lines of the log

As of the <service_name> you can use, they are : app, elastic, mongo, mongo_seed.

Tip : if you are checking the logs for the app and you don't see the following lines, or something similar, it is probably because mongo and elastic are not done restoring the data yet.

* Serving Flask app "main" (lazy loading)
* Environment: production
  WARNING: Do not use the development server in a production environment.
  Use a production WSGI server instead.
* Debug mode: on
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger PIN: 313-098-103

Built With

  • Flask - Web framework, with API views for easier communication with clientside javascript
  • Scrapy - Used to extract data from tripadvisor's website
  • MongoDB - A noSQL Database used to store unstructed data
  • Elasticsearch - Fast and pertinent Search Engine, used in Search page and Graph page
  • Docker - Used to facilitate portability, ease of deployment and scalibity of our project... Theoretically.

User Guide

Home Page

The home page is very basic : you can only scroll and lazy-load more review cards.
‌‌‌‌‌‌

Home Page

Search Page

If you want to find specific reviews or users, you found the perfect page !
‌‌‌‌‌‌

Search Page

Graph Page

Calling to the secret plot lover in you, here are a few plots to get a global view of grade distribution of reviews ‌‌‌‌‌‌

Graph Page

Reference guide

Crawling Ex Nihilo

If you'd like to populate the database yourself, here are the commands you'll need to run:

Start the container:

$ docker-compose up -d

Then, run each crawler individually:

$ docker-compose exec app scrapy crawl tripadvisor_attraction
# Crawls names for every g_value (tripadvisor attraction id) listed in json file
$ docker-compose exec app scrapy crawl tripadvisor_attraction_review
# Crawls places listed in json file, using attraction names scrapes before
$ docker-compose exec app scrapy crawl tripadvisor_user
# Crawls all users who left a review on places scraped above (this will take a while)
$ docker-compose exec app scrapy crawl tripadvisor_review
# Crawls the first ten reviews of all users present in the database (this will take even longer !)

You may want to stop crawling users at a certain point et carry on with reviews

If you want to know what g_values and d_values are, check out the comments in tripadvisor_crawler/items.py. If you wish to modify their starting values, you must change tripadvisor_crawler/spiders/g_values.json and tripadvisor_crawler/spiders/d_values_by_attraction.json

Why Elasticsearch ?

We chose elastic for a more understanding search. With it we can adapt to grammar (for instance words with and without an 's' at the end), or even better, spelling errors ! As is demonstrated with the two page gifs where magnifique and magnifike give the same output. We used elasticsearch in the search page (obviously) and for the first type of graph because of its 'intelligence' while searching. Elsewhere we simply used mongo.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •