Ability To Limit the Scraping Rates by Site & Scraper ie Domain #1665
+145
−13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1626
The change allows you to control the rate at which pages are scraped from a site. This is most useful when first scraping sites such as Naughty America or VRPorn which have DOS attack preventions. Although, it may also be useful for day-to-day use if you have a large number of custom sites for VRPorn.
Rate limiting uses the rate limiting options built into colly collectors for most scrapers. However, this does not work for scrapers that handle multiple sites/studios. Each studio/site has it's own collector and each would have their own separate rate limits, so in these cases the rate needs to be managed across multiple colly collectors for each studio. In these cases, the colly Visit methods calls are passed to another method first to manage the flow of calls. Other scrapers where dealing with multiple studios is not an issue, do not require any changes to implement rate limits, this is managed in the common CreateCollector function they all use
If no rate limits are set, then a scraper performs as it current does. The Rate Limits are stored in the kvs table, with the key "scraper_rate_limits", the suggested values for Naughty America and VRPorn are:
{ "sites": [ { "name": "www.naughtyamerica.com", "mindelay": 1000, "maxdelay": 2500 }, { "name": "vrporn.com", "mindelay": 4000, "maxdelay": 7500 } ] }
For users not comfortable modifying the database, the config can be loaded via a bundle, they will need to turn on Include Config Settings which is off by default. Copy of a bundle with suggested values attached.
FYI: Scenes from VRPorn can be scraped faster than the settings allow, however, scraping the Actor details seems to be a lot more sensitive.
People can add other sites if rrequired, these are just the 2 I know of and tested the settings with.
xbvr-content-bundle (Rate Limits).json