Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Reality Lovers scraper #1538

Merged
merged 1 commit into from
Jan 2, 2024
Merged

Conversation

toshski
Copy link
Contributor

@toshski toshski commented Dec 8, 2023

The RealityLovers API will now only accept requests in batches of 12.
Previously the scraper requested 3000, now will loop through multiple pages.

Note: this has been working over the past week, although initially, pages 22-29 would fail with access denied, after visiting the same corresponding pages through the website, they would start to work, although this could take 1-3 tries, before they would return scenes as well. Hopefully, the site is working consistently now.

@toshski
Copy link
Contributor Author

toshski commented Dec 19, 2023

The site still misbehaves. A lot of the content page calls now also return Not Found, e.g. engine.realitylovers.com/content/videoDetail?contentId=177734019 until you visit the page manually, then it is fine.

May have to change the whole scraper from using json via engine.realitylovers.com calls, to the scraping html from scene pages. However, not sure that will even work.

I would hold merging this for now.

@toshski
Copy link
Contributor Author

toshski commented Dec 30, 2023

This may be as good as it gets. The site has become very inconsistent with it's ability to serve content. I thought, maybe changing to call the pages rather than the API might work, but even just browsing to the pages has problems as well.

e.g. 10 minutes ago page 5 of the scene list worked, now there is an issue, which you can also see in the console debugger log, 404 from the call to engine.realitylovers.com. In the browser, you see the page, but with no scenes. But some others still work, before and after page 5. 20 minutes later its ok again

I don't think it is an anti-scraping thing, it effects browsing and using a VPN does not clear the issue.

This mod has fixed the issue the existing release has with calling with max=3000, that always fails and I've been keeping my library update to date for the last few weeks with the mod and no issues. But you would have major difficulties if you wanted to scrape a complete copy of the site.

Not sure it will get any better, the site is the problem and it hasn't gone away in the last 3 weeks. Scraping the web pages instead of API calls to engine.realitylovers.com won't work, as I can see the same issues there.

@crwxaj crwxaj merged commit 50e733a into xbapps:master Jan 2, 2024
1 check passed
@toshski toshski deleted the Fix_RealityLovers_Scraper branch January 3, 2024 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants