allow the user to customize crawler settings #738

hmtbr · 2024-10-24T07:05:31Z

Why are these changes needed?

This change will allow the user of the data-prep-connector to customize the crawler settings.
The following parameters will be exposed as args:

concurrent_requests
concurrent_requests_per_domain
download_delay
randomize_download_delay
download_timeout
autothrottle_enabled
autothrottle_max_delay
autothrottle_target_concurrency
robots_max_crawl_delay

They should be able to be changed depending on the target websites.

Related issue number (if any).

#737

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>

hmtbr · 2024-10-24T07:08:28Z

@Qiragg After this change you can specify the crawler parameters in the settings.

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>

touma-I · 2024-10-24T21:50:01Z

@Qiragg: You can follow the same process as before you review/approve. Once I see your approval, I will reach out to the product team to get their concurrence on the same before I merge. I don't expect this to be an issue since all the new parameters are optional with valid defaults.

touma-I · 2024-10-24T21:50:35Z

@hmtbr Please can you provide a rational/example why this is needed? Under what condition would the user of this module have to control those settings and what are the draw backs of exposing this? Thanks

touma-I · 2024-10-24T21:52:05Z

data-connector-lib/src/dpk_connector/core/crawler.py

@@ -85,6 +85,15 @@ def async_crawl(
    disallow_mime_types: Collection[str] = (),
    depth_limit: int = -1,
    download_limit: int = -1,
+    concurrent_requests: int = 20,


@hmtbr How did we come up with a default of 20 ? Also, maybe some rational for the other default values as well will be helpful. Thanks

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>

hmtbr · 2024-10-28T02:35:57Z

@touma-I The settings I added in this PR are for control over how frequently the data-prep-connector sends requests to the targeted website. A website sometimes doesn't have capability to handle 10 concurrent requests for example. In that case, we would like to set a lower value as concurrency so that we won't see server side denial or 403, 429 responses. On the other hand, if a website is very large and powerful, we might want to set a higher value for concurrency. That's why we want to let the user customize these settings.

A possible drawback would be for the user to set too aggressive configs resulting in attack on a website, but the user has the responsibility. It should not be a problem.

I updated the default values so that we can follow the Scrapy defaults. It should be reasonable.
https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests

Can you please review this again? Thanks.

touma-I · 2024-10-28T22:52:45Z

@touma-I The settings I added in this PR are for control over how frequently the data-prep-connector sends requests to the targeted website. A website sometimes doesn't have capability to handle 10 concurrent requests for example. In that case, we would like to set a lower value as concurrency so that we won't see server side denial or 403, 429 responses. On the other hand, if a website is very large and powerful, we might want to set a higher value for concurrency. That's why we want to let the user customize these settings.

A possible drawback would be for the user to set too aggressive configs resulting in attack on a website, but the user has the responsibility. It should not be a problem.

I updated the default values so that we can follow the Scrapy defaults. It should be reasonable. https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests

Can you please review this again? Thanks.

thanks @hmtbr This looks good. @Qiragg can you follow the same process as before to review/approve ?

Qiragg · 2024-10-29T04:55:53Z

@Qiragg: You can follow the same process as before you review/approve. Once I see your approval, I will reach out to the product team to get their concurrence on the same before I merge. I don't expect this to be an issue since all the new parameters are optional with valid defaults.

@touma-I Matsubara-san explained the case for having these additional arguments well.

I approve this PR.

allow the user to customize crawler settings

50701fa

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>

hmtbr marked this pull request as ready for review October 24, 2024 07:06

hmtbr requested a review from touma-I October 24, 2024 07:07

remove unused values

d47cdd7

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>

touma-I reviewed Oct 24, 2024

View reviewed changes

hmtbr added 2 commits October 28, 2024 11:10

update concurrency defaults

73a9f1c

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>

update defaults

bb7f6a3

Signed-off-by: Hiroya Matsubara <hmtbr@jp.ibm.com>

Qiragg approved these changes Oct 29, 2024

View reviewed changes

touma-I approved these changes Oct 31, 2024

View reviewed changes

hmtbr merged commit a725112 into dev Oct 31, 2024
5 checks passed

hmtbr deleted the connector-settings branch October 31, 2024 00:08

hmtbr mentioned this pull request Nov 4, 2024

bump connector version #769

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow the user to customize crawler settings #738

allow the user to customize crawler settings #738

hmtbr commented Oct 24, 2024

hmtbr commented Oct 24, 2024

touma-I commented Oct 24, 2024

touma-I commented Oct 24, 2024

touma-I Oct 24, 2024 •

edited

Loading

hmtbr commented Oct 28, 2024

touma-I commented Oct 28, 2024 •

edited

Loading

Qiragg commented Oct 29, 2024

allow the user to customize crawler settings #738

allow the user to customize crawler settings #738

Conversation

hmtbr commented Oct 24, 2024

Why are these changes needed?

Related issue number (if any).

hmtbr commented Oct 24, 2024

touma-I commented Oct 24, 2024

touma-I commented Oct 24, 2024

touma-I Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

hmtbr commented Oct 28, 2024

touma-I commented Oct 28, 2024 • edited Loading

Qiragg commented Oct 29, 2024

touma-I Oct 24, 2024 •

edited

Loading

touma-I commented Oct 28, 2024 •

edited

Loading