Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: better newUrlFunction for ProxyConfiguration #2392

Merged
merged 12 commits into from
Apr 4, 2024

Conversation

barjin
Copy link
Contributor

@barjin barjin commented Mar 26, 2024

Based on changes from #2348 , this PR simplifies the proxy handling in the browser crawlers and makes those more intuitive.

Ready for reviews!

Closes #2065

@barjin barjin self-assigned this Mar 26, 2024
@github-actions github-actions bot added this to the 86th sprint - Tooling team milestone Mar 26, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 26, 2024
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 28, 2024
@barjin barjin marked this pull request as ready for review March 28, 2024 15:34
@barjin
Copy link
Contributor Author

barjin commented Mar 28, 2024

Also cc #2310 - this solves the issue from there - but only for proxyUrls.

Copy link
Member

@vladfrangu vladfrangu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me otherwise 👀

packages/core/src/proxy_configuration.ts Outdated Show resolved Hide resolved
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few nits/ideas

packages/browser-pool/src/browser-pool.ts Outdated Show resolved Hide resolved
packages/browser-pool/src/browser-pool.ts Outdated Show resolved Hide resolved
packages/core/src/proxy_configuration.ts Outdated Show resolved Hide resolved
@barjin
Copy link
Contributor Author

barjin commented Apr 3, 2024

Looking at apify-sdk's ProxyConfiguration, the implementation there should be quite straightforward once we merge this - we inherit Crawlee's ProxyConfiguration class and can call the protected _handleTieredUrl() method with all the logic from there. Any ideas abt how we can make it even more seamless now here?

@barjin barjin merged commit 330598b into master Apr 4, 2024
8 checks passed
@barjin barjin deleted the feat/proxy-per-browser branch April 4, 2024 09:57
@harshmaur
Copy link

@barjin documentation for this ? How do we configure it?

@barjin
Copy link
Contributor Author

barjin commented Apr 8, 2024

@harshmaur - keep in mind that this feature is not yet in the stable version of Crawlee (will be included in the next release). Right now, you can try it out by installing the next tag (npm install crawlee@next).

These are mostly internal changes to how we manage proxies in browser-based crawlers, but the external API changed a bit too. See the JSDoc for newUrlFunction - you can now return null from the function if you want the crawler to not use proxy for this request. Also, the newUrlFunction is now called with (optional) request object. All in all, you can do something like this:

...
newUrlFunction: (sessionId, { request } = {}) {
   if (request?.url.includes('example.com')) return null;
   return 'default.proxy.com';
}
...

Let us know if you encounter any issues with this. Cheers!

@harshmaur
Copy link

harshmaur commented Apr 8, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to turn off proxy based on request url
4 participants