-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow the user to customize crawler settings #738
Conversation
Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>
@Qiragg After this change you can specify the crawler parameters in the settings. |
Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>
@Qiragg: You can follow the same process as before you review/approve. Once I see your approval, I will reach out to the product team to get their concurrence on the same before I merge. I don't expect this to be an issue since all the new parameters are optional with valid defaults. |
@hmtbr Please can you provide a rational/example why this is needed? Under what condition would the user of this module have to control those settings and what are the draw backs of exposing this? Thanks |
@@ -85,6 +85,15 @@ def async_crawl( | |||
disallow_mime_types: Collection[str] = (), | |||
depth_limit: int = -1, | |||
download_limit: int = -1, | |||
concurrent_requests: int = 20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmtbr How did we come up with a default of 20 ? Also, maybe some rational for the other default values as well will be helpful. Thanks
Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>
Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>
@touma-I The settings I added in this PR are for control over how frequently the data-prep-connector sends requests to the targeted website. A website sometimes doesn't have capability to handle 10 concurrent requests for example. In that case, we would like to set a lower value as concurrency so that we won't see server side denial or 403, 429 responses. On the other hand, if a website is very large and powerful, we might want to set a higher value for concurrency. That's why we want to let the user customize these settings. A possible drawback would be for the user to set too aggressive configs resulting in attack on a website, but the user has the responsibility. It should not be a problem. I updated the default values so that we can follow the Scrapy defaults. It should be reasonable. Can you please review this again? Thanks. |
thanks @hmtbr This looks good. @Qiragg can you follow the same process as before to review/approve ? |
@touma-I Matsubara-san explained the case for having these additional arguments well. I approve this PR. |
Why are these changes needed?
This change will allow the user of the data-prep-connector to customize the crawler settings.
The following parameters will be exposed as args:
They should be able to be changed depending on the target websites.
Related issue number (if any).
#737