Skip to content

Commit

Permalink
Merge pull request #26 from algo7/fix/comment_date_of_stay_not_always…
Browse files Browse the repository at this point in the history
…_there

Fix/comment date of stay not always there
  • Loading branch information
algo7 committed May 4, 2023
2 parents 7d3278f + 47bf8fa commit 2be6128
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 13 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ latest: Pulling from algo7/tripadvisor-review-scraper/scrap
## Known Issues
1. The hotel scraper works for English reviews only.
2. The restaurant scraper can only scrap english reivews or french reviews.
3. The hotel scraper uses date of review instead of date of stay as the date because the date of stay is not always available.

# Container Provisioner
Container Provisioner is a tool written in [Go](https://go.dev/) that provides a UI for the users to interact with the scraper. It uses [Docker API](https://docs.docker.com/engine/api/) to provision the containers and run the scraper. The UI is written in raw HTML and JavaScript while the backend web framwork is [Fiber](https://docs.gofiber.io/).
Expand Down
20 changes: 20 additions & 0 deletions libs/utils.js
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,26 @@ const monthStringToNumber = (monthString) => {
return 10;
case 'November':
return 11;
case 'Jan':
return 1;
case 'Feb':
return 2;
case 'Mar':
return 3;
case 'Apr':
return 4;
case 'Jun':
return 6;
case 'Jul':
return 7;
case 'Aug':
return 8;
case 'Sep':
return 9;
case 'Oct':
return 10;
case 'Nov':
return 11;
default:
return 12;
}
Expand Down
39 changes: 26 additions & 13 deletions scrapers/hotel.js
Original file line number Diff line number Diff line change
Expand Up @@ -193,24 +193,38 @@ const scrape = async (totalReviewCount, reviewPageUrls, position, hotelName, hot
});

// Extract date of stay
const commentDateOfStay = await page.evaluate(async () => {
const commentDateOfReview = await page.evaluate(async () => {

const commentDateOfStayBlocks = document.getElementsByClassName('teHYY')
// const commentDateOfStayBlocks = document.getElementsByClassName('teHYY')
const commentDateBlocks = document.getElementsByClassName("cRVSd")

const dates = [];
// const datesOfStay = [];
const datesOfReview = [];

for (let index = 0; index < commentDateOfStayBlocks.length; index++) {

// Split the date of stay text block into an array
const splitted = commentDateOfStayBlocks[index].innerText.split(' ')
// for (let index = 0; index < commentDateOfStayBlocks.length; index++) {

dates.push({
month: splitted[3],
year: splitted[4],
// // Split the date of stay text block into an array
// const splitted = commentDateOfStayBlocks[index].innerText.split(' ')

// datesOfStay.push({
// month: splitted[3],
// year: splitted[4],
// });
// }

for (let index = 0; index < commentDateBlocks.length; index++) {

// Split the date of comment text block into an array
const splitted = commentDateBlocks[index].children[0].innerText.split('review').pop().split(' ')

datesOfReview.push({
month: splitted[1],
year: splitted[2],
});
}

return dates;
return datesOfReview;
});

// Extract comments text
Expand All @@ -230,12 +244,11 @@ const scrape = async (totalReviewCount, reviewPageUrls, position, hotelName, hot

// Format (for CSV processing) the reviews so each review of each page is in an object
const formatted = commentContent.map((comment, index) => {

return {
title: commentTitle[index],
content: comment,
month: monthStringToNumber(commentDateOfStay[index].month),
year: commentDateOfStay[index].year,
month: monthStringToNumber(commentDateOfReview[index].month),
year: commentDateOfReview[index].year,
rating: commentRatingStringToNumber(commentRating[index]),
};
});
Expand Down

0 comments on commit 2be6128

Please sign in to comment.