Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow the user to customize crawler settings #738

Merged
merged 4 commits into from
Oct 31, 2024
Merged

allow the user to customize crawler settings #738

merged 4 commits into from
Oct 31, 2024

Conversation

hmtbr
Copy link
Collaborator

@hmtbr hmtbr commented Oct 24, 2024

Why are these changes needed?

This change will allow the user of the data-prep-connector to customize the crawler settings.
The following parameters will be exposed as args:

  • concurrent_requests
  • concurrent_requests_per_domain
  • download_delay
  • randomize_download_delay
  • download_timeout
  • autothrottle_enabled
  • autothrottle_max_delay
  • autothrottle_target_concurrency
  • robots_max_crawl_delay

They should be able to be changed depending on the target websites.

Related issue number (if any).

#737

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>
@hmtbr hmtbr marked this pull request as ready for review October 24, 2024 07:06
@hmtbr hmtbr requested a review from touma-I October 24, 2024 07:07
@hmtbr
Copy link
Collaborator Author

hmtbr commented Oct 24, 2024

@Qiragg After this change you can specify the crawler parameters in the settings.

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>
@touma-I
Copy link
Collaborator

touma-I commented Oct 24, 2024

@Qiragg: You can follow the same process as before you review/approve. Once I see your approval, I will reach out to the product team to get their concurrence on the same before I merge. I don't expect this to be an issue since all the new parameters are optional with valid defaults.

@touma-I
Copy link
Collaborator

touma-I commented Oct 24, 2024

@hmtbr Please can you provide a rational/example why this is needed? Under what condition would the user of this module have to control those settings and what are the draw backs of exposing this? Thanks

@@ -85,6 +85,15 @@ def async_crawl(
disallow_mime_types: Collection[str] = (),
depth_limit: int = -1,
download_limit: int = -1,
concurrent_requests: int = 20,
Copy link
Collaborator

@touma-I touma-I Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmtbr How did we come up with a default of 20 ? Also, maybe some rational for the other default values as well will be helpful. Thanks

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>
Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>
@hmtbr
Copy link
Collaborator Author

hmtbr commented Oct 28, 2024

@touma-I The settings I added in this PR are for control over how frequently the data-prep-connector sends requests to the targeted website. A website sometimes doesn't have capability to handle 10 concurrent requests for example. In that case, we would like to set a lower value as concurrency so that we won't see server side denial or 403, 429 responses. On the other hand, if a website is very large and powerful, we might want to set a higher value for concurrency. That's why we want to let the user customize these settings.

A possible drawback would be for the user to set too aggressive configs resulting in attack on a website, but the user has the responsibility. It should not be a problem.

I updated the default values so that we can follow the Scrapy defaults. It should be reasonable.
https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests

Can you please review this again? Thanks.

@touma-I
Copy link
Collaborator

touma-I commented Oct 28, 2024

@touma-I The settings I added in this PR are for control over how frequently the data-prep-connector sends requests to the targeted website. A website sometimes doesn't have capability to handle 10 concurrent requests for example. In that case, we would like to set a lower value as concurrency so that we won't see server side denial or 403, 429 responses. On the other hand, if a website is very large and powerful, we might want to set a higher value for concurrency. That's why we want to let the user customize these settings.

A possible drawback would be for the user to set too aggressive configs resulting in attack on a website, but the user has the responsibility. It should not be a problem.

I updated the default values so that we can follow the Scrapy defaults. It should be reasonable. https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests

Can you please review this again? Thanks.

thanks @hmtbr This looks good. @Qiragg can you follow the same process as before to review/approve ?

@Qiragg
Copy link

Qiragg commented Oct 29, 2024

@Qiragg: You can follow the same process as before you review/approve. Once I see your approval, I will reach out to the product team to get their concurrence on the same before I merge. I don't expect this to be an issue since all the new parameters are optional with valid defaults.

@touma-I Matsubara-san explained the case for having these additional arguments well.

I approve this PR.

@hmtbr hmtbr merged commit a725112 into dev Oct 31, 2024
5 checks passed
@hmtbr hmtbr deleted the connector-settings branch October 31, 2024 00:08
@hmtbr hmtbr mentioned this pull request Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants