Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability To Limit the Scraping Rates by Site & Scraper ie Domain #1665

Merged
merged 2 commits into from
Apr 2, 2024

Conversation

toshski
Copy link
Contributor

@toshski toshski commented Mar 21, 2024

Fixes #1626

The change allows you to control the rate at which pages are scraped from a site. This is most useful when first scraping sites such as Naughty America or VRPorn which have DOS attack preventions. Although, it may also be useful for day-to-day use if you have a large number of custom sites for VRPorn.

Rate limiting uses the rate limiting options built into colly collectors for most scrapers. However, this does not work for scrapers that handle multiple sites/studios. Each studio/site has it's own collector and each would have their own separate rate limits, so in these cases the rate needs to be managed across multiple colly collectors for each studio. In these cases, the colly Visit methods calls are passed to another method first to manage the flow of calls. Other scrapers where dealing with multiple studios is not an issue, do not require any changes to implement rate limits, this is managed in the common CreateCollector function they all use

If no rate limits are set, then a scraper performs as it current does. The Rate Limits are stored in the kvs table, with the key "scraper_rate_limits", the suggested values for Naughty America and VRPorn are:
{ "sites": [ { "name": "www.naughtyamerica.com", "mindelay": 1000, "maxdelay": 2500 }, { "name": "vrporn.com", "mindelay": 4000, "maxdelay": 7500 } ] }

For users not comfortable modifying the database, the config can be loaded via a bundle, they will need to turn on Include Config Settings which is off by default. Copy of a bundle with suggested values attached.

FYI: Scenes from VRPorn can be scraped faster than the settings allow, however, scraping the Actor details seems to be a lot more sensitive.

People can add other sites if rrequired, these are just the 2 I know of and tested the settings with.
xbvr-content-bundle (Rate Limits).json

@toshski
Copy link
Contributor Author

toshski commented Mar 26, 2024

I need to refactor this, apparently go maps are not safe for concurrent access from multiple threads. This can lead to the mutex used not always working and very occasionally may try to unlock an unlocked mutex, which will cause a panic.

@toshski
Copy link
Contributor Author

toshski commented Mar 26, 2024

I have refactored Scraper_Rate_Limits code to use slices instead of a go map, which apparently, is not guaranteed safe for concurrent use.

@crwxaj crwxaj merged commit dfc27a4 into xbapps:master Apr 2, 2024
1 check passed
@toshski toshski deleted the Scraper_Rate_Limits branch May 31, 2024 18:37
@pops64 pops64 mentioned this pull request Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants