-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] reviews_all doesn't download all reviews of an app with large amount of reviews #209
Comments
Im seeing the same issue even when I set the number of reviews (25000 in my case). Im only getting back about 500 and the output number changes each time I run it. |
Me too, and I found that the output number is always a multiple of 199. It seems that Google Play randomly block the retrieval of next page of reviews. |
This is probably a dupe of #208. The error seems to be the play service intermittently returning an error inside a 200 success code, which then fails to parse as the json the library expects. It seems to contain this
The error seems to happen frequently but not reliably. Scraping in chunks of 200 reviews, basically every request has a decent chance of crashing, resulting in usually 200-1000 total reviews scraped before it craps out. Currently, the library swallows this exception silently and quits. Handling this error lets the scraping continue as normal. We monkey-patched around it like this and seem to have gotten back to workable scraping: import google_play_scraper
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post
def _fetch_review_items(
url: str,
app_id: str,
sort: int,
count: int,
filter_score_with: Optional[int],
pagination_token: Optional[str],
):
dom = post(
url,
Formats.Reviews.build_body(
app_id,
sort,
count,
"null" if filter_score_with is None else filter_score_with,
pagination_token,
),
{"content-type": "application/x-www-form-urlencoded"},
)
# MOD error handling
if "error.PlayDataError" in dom:
return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
# ENDMOD
match = json.loads(Regex.REVIEWS.findall(dom)[0])
return json.loads(match[0][2])[0], json.loads(match[0][2])[-1][-1]
google_play_scraper.reviews._fetch_review_items = _fetch_review_items |
Still not able to get more than a few hundred reviews. |
Hey @adilosa @funnan @paulolacombe can you all please tell how to implement this in order to fix this issue. I am trying to scrape reviews using reviews_all in Google Colab but it wont work for me. It would be great if you could help! |
Hey @Shivam-170103, you need to use the code lines that @adilosa provided to replace the corresponding ones in the reviews.py function file in your environment. Let me know if that helps as I am not that familiar with Google Colab. |
Thanks @adilosa and @paulolacombe , |
I don't know why but even applying @adilosa 's solution the number of reviews returned here is still very low. |
Hello! I tried the monkey patch suggested by @adilosa, scraping a big app like eBay. Instead of getting 8 or 10 reviews, I did end up getting 199, but I am expecting thousands of reviews (that's how it used be several weeks ago). Any updated for getting this fixed? Cheers, and thank you |
Same for me TT: the number of reviews scraped has plummeted since around 15 Feb and @adilosa's patch does not change my numbers by much |
This mod did not work for me either. I tried a different approach that worked for me: In reviews.py:
|
@funnan, thanks for sharing it! |
@funnan Thank you! I tried that and seemed to get a little more reviews, but not the full count. But I'm not sure if I implemented the patch correctly. What I did was put the entire
Is this how to apply your patch? If not, could you provide an example of the correct way? Thanks so much |
Both mods dont work for me, first doesn't change anything and funnan's just loops forever and never returns. |
I'm having the same issue, and trying to use the workaround posted by @adilosa (thx!). However, it's giving me a pagination token error.
Can someone please tell me what this should be set at? I've tried None, 0, 100, 200, and 2000 as values for 'pagination_token', but always get the same TypeError. This is how I have the variables defined:
Greatly appreciate any input. |
Here's my code ( I am fixing the number of reviews I need and break the loop when that number has crossed):
and in reviews.py I added the mod as my original comment. |
I have tried with your code, and it worked for me running on colab |
@funnan Thank you, that works! |
Thanks bro worked for me as well |
Unfortunately, it is still not working for me. I suspect that Google has put some limitations on the crawling
The progress bar is raised after displaying the following: |
@myownhoney did you edit the reviews.py file using the fix from @funnan? |
it works now :) Cheers! |
@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well |
My code is in the previous comment. Have you tried editing reviews.py?
` |
if we use this code before running our script is it compulsory to edit reviews.py first? or just run this code and that's all!! because the @funnan patch is worked for me on Jupiter |
|
I digged a bit in the code starting from @adilosa's solution.
After applying the required changes, this is the new patch to be done in
With this changes it appears to be working |
Thanks for fix! I still have problem get it work. Since the filter_device_with is needed, do we need to place it in other functions as well when _fetch_review_items is used? And I tried your patch but without the filter_device_with parameter, somehow it works. But I'm wondering if there is any problem with it? |
thank you so much @adilosa , your code worked well. for you who are experiencing the same problem, here's what I did:
add this code above match variable in _fetch_review_items function
before:
after
then i run code from documentation
result: |
In my experience it is not necessary to add the parameter to other parts of the code, but I am only using the reviews_all method Also, it seems that different people are getting different error messages (perhaps depending on their location?), so it really depends on what behaviour the program has on your side |
Hi everybody. from typing import Optional
import pandas as pd |
Hey I was having a similar issue yesterday --> You have to make sure you don't run import reviews after you run the fix on fetch_review or it will revert it back to the broken form. Try importing at the beginning, then running the fix, then running the call and that should work! |
Thank you very much for the advice! |
Today I tried to collect reviews all day but to no avail. I tried all the methods from the thread, but without success.
MOD error handling
result, continuation_token = reviews( If you pass
|
Fill in your app_id and try running this:
|
when debugging a Node.JS sister project I was able to fix similar problem by ensuring cookie persistence from first request - for testing purposes you can grab |
i have same issue, can only take 398 comments/reviews using this simple code:
If anyone find a fix let us know. |
Thanks for the help. I'll try to run the code today. |
Hello guys. Sorry for not paying attention to the library. I've read all the discussions, and I've confirmed that the small modifications in @funnan are valid for most. Unfortunately, Google Play is conducting various experiments in various countries, including a/b testing of the UI and data structure. Therefore, although most of them can be solved with the method proposed by @funnan , for example, the following function calls generate infinite loops. reviews_all(
"com.poleposition.AOSheroking",
sort=Sort.MOST_RELEVANT,
country="kr",
lang="ko",
) So @funnan and your every suggestions are really good, but it can cause infinite loop-like problems in edge case, so I need to research it a more. By default, this library is not official, and Google Play does not allow crawling through robots.txt. Therefore, I think it might have been better not to support complex features like I think it would be good for everyone to write your own |
We also observed that the API response from Google randomly didn't include the token on some calls, meaning the loop would end as if it was the last page. We simply retried the request a few times and usually get a continuation token eventually! |
@adilosa @JoMingyu pls try capturing |
I also observed different error messages from other users, I believe Google's API is currently not working 100% correctly. |
I'll give it a try. That makes sense. I'm sorry, but I don't have a lot of time to spend on it. However, I'll do my best to work on it. |
Im not expert on this but.. i find something wierd. If i change the "country" and "language" i get more reviews, maybe something changed on google side?! I made this code and i get more reviews then if i fix the country and language.
Any suggestions? |
This is expected, the API returns reviews from a single country and in a sigle language (default is US and English) |
@JoMingyu could it be that it will scrape until it cant divide by 200 that is making this happen?
|
Hi all! I found this article on this issue. Unfortunately, I don't yet have enough knowledge to run it. https://github.com/scrapehero-code/google-play-review-scraper/blob/main/scraper.py |
Not a good solve. |
I used
thanks man, I've tried this, using latest version not working somehow, but using old version works fine, this is worked with version 0.1.2 |
Thank you so much to all contributors on this thread and @asornbor for the summary. It worked for me using Google Colab and the latest version of
|
To resolve this issue, simply retry the operation without exiting when a PlayGatewayError is returned. |
Could you guys try on 1.2.7? I released #216 for now. |
https://play.google.com/store/apps/details?id=redmasiva.bibliachat&hl=en_IN&pli=1 only 124 items |
Library version
1.2.6
Describe the bug
I cannot download all the reviews of an app with large amount of reviews. The number of downloaded reviews is always a multiple of 199.
Code
Expected behavior
Expect to download all the reviews with
reviews_all
, which should be at least 20kAdditional context
No
The text was updated successfully, but these errors were encountered: