Skip to content

Commit

Permalink
Added option to scrape X ammount of pages
Browse files Browse the repository at this point in the history
  • Loading branch information
augustobottelli committed Apr 3, 2020
1 parent 89726b1 commit dd96fa3
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 4 deletions.
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,11 @@ $ pip3 install -r requirements.txt
```
$ python3 restaurants_scraper.py --city "Buenos Aires"
```
- If you wish to scrape just X pages instead of the whole catalog, you can include:
```
$ python3 restaurants_scraper.py --city "Buenos Aires" --max_pages X
```

It currently works for these cities:
- Buenos Aires
- Panama City
Expand All @@ -39,4 +44,4 @@ More cities can be added by including its city code and name from tripadvisor UR
## Disclaimer
As mentioned before, the program is a web scraper and its correctness relies on Tripadvisor's HTML structure. If the page suffers changes, the program will break.

As of today **2019/01/07 the program still works**
As of today **2020/04/03 the program still works**
9 changes: 6 additions & 3 deletions restaurants_scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,8 @@ def get_restaurant_info(restaurant_tag):

def _set_cli():
parser = argparse.ArgumentParser()
parser.add_argument("--city", type=str)
parser.add_argument("--city", type=str, required=True, help="Need to specify a city")
parser.add_argument("--max_pages", type=int)
args, unknown = parser.parse_known_args()
if unknown:
logging.warning(f"Unknown parameter {unknown}")
Expand All @@ -135,7 +136,10 @@ def _make_csv(restaurants_lists, city, date):
page_offset = 0
full_url = BASE_URL + f"/Restaurants-{city_code}-oa{page_offset}-{city_name}"
first_page = get_html_and_parse(full_url)
last_page_offset = _get_last_page_offset(first_page)
if not args.max_pages:
last_page_offset = _get_last_page_offset(first_page)
else:
last_page_offset = (args.max_pages - 1) * PAGE_OFFSET_INTERVAL
last_page = (last_page_offset / PAGE_OFFSET_INTERVAL) + 1

logging.info(f"Scraping page 1 of {int(last_page)}")
Expand All @@ -149,5 +153,4 @@ def _make_csv(restaurants_lists, city, date):
restaurants_information = get_restaurants_info(
restaurants_data, page_html, thread_pool
)

_make_csv(restaurants_data, args.city, DATE)

2 comments on commit dd96fa3

@aquinto92
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am quite new on python so sorry in advance if this seems too easy! First thanks for the code, I think it's great. Unfortunately I get a lot of repeated restaurants as responses.. Any idea why that might be?

Thanks!

@augustobottelli
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi aquinto! Thank you for you words!
Yes, I was told that this scrapper became useless since Tripadvisor switched to a dynamic website structure an few months ago so there's no way to paginate with a static URL. He pointed out in this issue: #3. A solution for this problem is to use Selenium but it means a total refactor of the project which I don't have the time to do at the moment. Feel free to ask any other doubts.

Please sign in to comment.