This project has been developped during the course DSIA-4203C. The goal was to crawl and scrape data from a chosen website. We decided to retrieve users and reviews from Tripadvisor.
- Getting Started
1.1 Prerequisites
1.2 Installing and Running
1.3 The app doesn't run ? - User Guide
2.1 Home Page
2.2 Search Page
2.3 Graph Page - Reference Guide
3.1 Crawling Ex Nihilo
3.2 Why Elasticsearch ?
These instructions will get you a copy of the project up and running on your local machine.
Thanks to docker's magic, it's really the only thing you need to setup - ok, along with docker-compose. You can install docker here. If you don't fit the system requirements, don't forget to check Docker Toolbox. As for Docker Compose : it's over here.
Clone this repository with :
$ git clone https://github.com/borisghidaglia/data-engineering.git
Go into the repository :
$ cd data-engineering/
Then, you need to download the dumped data we prepared for you, so that it can be restored in the mongo container you will start. - Make sure you have curl and unzip installed
$ mkdir data && \
mkdir data/dump && \
cd data/dump && \
curl -o tripadvisor_dump.zip https://perso.esiee.fr/~prolonga/data/tripadvisor_dump.zip && \
unzip tripadvisor_dump.zip && \
rm tripadvisor_dump.zip && \
cd ../..
If you downloaded Docker Desktop, make sure the app is running. You've got Docker Toolbox ? If the docker-machine isn't started yet, run :
$ docker-machine start default
Finally :
$ docker-compose up -d
Once app, mongo, mongo_seed and elastic are built and up, and after waiting for a little while, you will be able to use the app in your browser.
Do not panic. Docker will help you understand what is going on by letting you access service logs like so :
docker-compose logs -f --tail=50 <service_name>
- -f is used to "follow" the logs as it is beeing written by the service output.
- --tail=30 restrict the output to the last 30 lines of the log
As of the <service_name> you can use, they are : app, elastic, mongo, mongo_seed.
Tip : if you are checking the logs for the app and you don't see the following lines, or something similar, it is probably because mongo and elastic are not done restoring the data yet.
* Serving Flask app "main" (lazy loading)
* Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
* Debug mode: on
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger PIN: 313-098-103
- Flask - Web framework, with API views for easier communication with clientside javascript
- Scrapy - Used to extract data from tripadvisor's website
- MongoDB - A noSQL Database used to store unstructed data
- Elasticsearch - Fast and pertinent Search Engine, used in Search page and Graph page
- Docker - Used to facilitate portability, ease of deployment and scalibity of our project... Theoretically.
The home page is very basic : you can only scroll and lazy-load more review cards.
If you want to find specific reviews or users, you found the perfect page !
Calling to the secret plot lover in you, here are a few plots to get a global view of grade distribution of reviews
If you'd like to populate the database yourself, here are the commands you'll need to run:
Start the container:
$ docker-compose up -d
Then, run each crawler individually:
$ docker-compose exec app scrapy crawl tripadvisor_attraction
# Crawls names for every g_value (tripadvisor attraction id) listed in json file
$ docker-compose exec app scrapy crawl tripadvisor_attraction_review
# Crawls places listed in json file, using attraction names scrapes before
$ docker-compose exec app scrapy crawl tripadvisor_user
# Crawls all users who left a review on places scraped above (this will take a while)
$ docker-compose exec app scrapy crawl tripadvisor_review
# Crawls the first ten reviews of all users present in the database (this will take even longer !)
You may want to stop crawling users at a certain point et carry on with reviews
If you want to know what g_values and d_values are, check out the comments in tripadvisor_crawler/items.py
. If you wish to modify their starting values, you must change tripadvisor_crawler/spiders/g_values.json
and tripadvisor_crawler/spiders/d_values_by_attraction.json
We chose elastic for a more understanding search. With it we can adapt to grammar (for instance words with and without an 's' at the end), or even better, spelling errors ! As is demonstrated with the two page gifs where magnifique
and magnifike
give the same output.
We used elasticsearch in the search page (obviously) and for the first type of graph because of its 'intelligence' while searching. Elsewhere we simply used mongo.