-
Notifications
You must be signed in to change notification settings - Fork 735
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: adaptive playwright crawler (#2316)
This uses the newly added restricted crawling contexts to execute request handlers. This allows us to compare browser and http-only request handler runs for a request and switch to http-only crawling on sites that we predict to be static. More information will be added here. The intended usage is as follows: ```ts import { AdaptivePlaywrightCrawler } from 'crawlee'; const startUrls = [{url: 'https://warehouse-theme-metal.myshopify.com/collections', label: 'START'}]; const crawler = new AdaptivePlaywrightCrawler({ requestHandler: async ({ request, enqueueLinks, pushData, querySelector }) => { console.log(`Processing: ${request.url} (${request.label})`); if (request.label === 'DETAIL') { const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440'] const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser' const title = (await querySelector('.product-meta h1')).text(); const sku = (await querySelector('span.product-meta__sku-number')).text(); const $prices = await querySelector('span.price') const currentPriceString = $prices.filter(':contains("$")').first().text() const rawPrice = currentPriceString.split('$')[1]; const price = Number(rawPrice.replaceAll(',', '')); const inStockElements = await querySelector('span.product-form__inventory') const inStock = inStockElements.filter(':contains("In stock")').length > 0; const results = { url: request.url, manufacturer, title, sku, currentPrice: price, availableInStock: inStock, }; await pushData(results); } else if (request.label === 'CATEGORY') { await enqueueLinks({ selector: '.product-item > a', label: 'DETAIL', // <= note the different label }); await enqueueLinks({ selector: 'a.pagination__next', label: 'CATEGORY', // <= note the same label }); } else if (request.label === 'START') { await enqueueLinks({ selector: '.collection-block-item', label: 'CATEGORY', }); } }, renderingTypeDetectionRatio: 0.1, maxRequestsPerCrawl: 100, maxRequestRetries: 0, minConcurrency: 1, maxConcurrency: 1, headless: true, }); await crawler.run(startUrls); ``` When handling a request from the queue, the crawler 1. tries to predict the rendering type (static/client only) based on URL, label and potentially other criteria (using a logistic regression model that gets updated on the fly) 2. for static pages, a HTTP-only scrape is done and the request handler works with Cheerio-based portadom 3. for client only pages, a playwright scrape is done and the request handler receives a portadom instance that uses Playwright locators (hence it waits for content to appear implicitly) 4. for a configurable percentage of requests, a detection is done (also if we're not confident about the prediction) - both HTTP-only and playwright scrapes are done and the results are compared. If (and only if) the HTTP-only scrape behaves the same, we conclude the page is static and update our logistic regression model.
- Loading branch information
Showing
17 changed files
with
1,025 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,9 @@ | ||
export * from '@crawlee/browser'; | ||
export * from './internals/playwright-crawler'; | ||
export * from './internals/playwright-launcher'; | ||
export * from './internals/adaptive-playwright-crawler'; | ||
|
||
export * as playwrightUtils from './internals/utils/playwright-utils'; | ||
export * as playwrightClickElements from './internals/enqueue-links/click-elements'; | ||
export type { DirectNavigationOptions as PlaywrightDirectNavigationOptions } from './internals/utils/playwright-utils'; | ||
export type { RenderingType } from './internals/utils/rendering-type-prediction'; |
Oops, something went wrong.