Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: replace fakestore with warehouse-theme-metal.myshopify.com #1104

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -11,99 +11,110 @@ slug: /anti-scraping/mitigation/using-proxies

---

In the [**Web scraping for beginners**](../../scraping_basics_javascript/crawling/pro_scraping.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg.
In the [**Web scraping for beginners**](../../scraping_basics_javascript/index.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg.

Because proxies are so widely used in the scraping world, Crawlee has been equipped with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool.

## Implementing proxies in a scraper {#implementing-proxies}
## Implementing proxies {#implementing-proxies}

Let's borrow some scraper code from the end of the [pro-scraping](../../scraping_basics_javascript/crawling/pro_scraping.md) lesson in our **Web Scraping for Beginners** course and paste it into a new file called **proxies.js**. This code enqueues all of the product links on [demo-webstore.apify.org](https://demo-webstore.apify.org)'s on-sale page, then makes a request to each product page and scrapes data about each one:
Let's build on top of the code which appears at the end of the [Professional scraping](../../scraping_basics_javascript/crawling/pro_scraping.md) lesson of the **Web Scraping for Beginners** course.

```js
// crawlee.js
Let's paste the same code to a new file, `proxies.js`, and make some changes. The code crawls the [Sales](https://warehouse-theme-metal.myshopify.com/collections/sales) page of a sample e-commerce website. It goes through all of the product links, enqueues requests to each page with a product detail, and scrapes data about all of the products:

```js title=proxies.js
import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
requestHandler: async ({ $, request, enqueueLinks }) => {
if (request.label === 'START') {
console.log(`Fetching URL: ${request.url}`);

if (request.label === 'start-url') {
await enqueueLinks({
selector: 'a[href*="/product/"]',
selector: 'a.product-item__title',
});

// When on the START page, we don't want to
// extract any data after we extract the links.
return;
}

// We copied and pasted the extraction code
// from the previous lesson
const title = $('h3').text().trim();
const price = $('h3 + div').text().trim();
const description = $('div[class*="Text_body"]').text().trim();
const title = $('h1').text().trim();
const vendor = $('a.product-meta__vendor').text().trim();
const price = $('span.price').contents()[2].nodeValue;
const reviewCount = parseInt($('span.rating__caption').text(), 10);
const description = $('div[class*="description"] div.rte').text().trim();

// Instead of saving the data to a variable,
// we immediately save everything to a file.
await Dataset.pushData({
title,
description,
vendor,
price,
reviewCount,
description,
});
},
});

await crawler.addRequests([{
url: 'https://demo-webstore.apify.org/search/on-sale',
// By labeling the Request, we can very easily
// identify it later in the requestHandler.
label: 'START',
url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
label: 'start-url',
}]);

await crawler.run();
```

In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free [proxy scraper](https://apify.com/mstephen190/proxy-scraper) on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and configure it with our custom proxies, like so:
We'll want all the requests to go through a proxies. For that we obviously need proxies! To get some, we can use Matthias Stephens' [free proxy scraper](https://apify.com/mstephen190/proxy-scraper). It can find tens of reliable proxies out of the thousands it scrapes.

Once we have a list of proxies, we can add [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and pass it to our crawler.

Proxy pools usually consist of many proxy URLs, but for the sake of simplicity of this lesson we'll list just three. At the time you're reading this text, they most probably won't work anymore, so be sure to use your own values.

```js
import { ProxyConfiguration } from 'crawlee';
import { CheerioCrawler, Dataset, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
proxyUrls: ['http://45.42.177.37:3128', 'http://43.128.166.24:59394', 'http://51.79.49.178:3128'],
});
```

Awesome, so there's our proxy pool! Usually, a proxy pool is much larger than this; however, a three proxies pool is totally fine for tutorial purposes. Finally, we can pass the `proxyConfiguration` into our crawler's options:

```js
const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: async ({ $, request, enqueueLinks }) => {
if (request.label === 'START') {
console.log(`Fetching URL: ${request.url}`);

if (request.label === 'start-url') {
await enqueueLinks({
selector: 'a[href*="/product/"]',
selector: 'a.product-item__title',
});
return;
}

const title = $('h3').text().trim();
const price = $('h3 + div').text().trim();
const description = $('div[class*="Text_body"]').text().trim();
const title = $('h1').text().trim();
const vendor = $('a.product-meta__vendor').text().trim();
const price = $('span.price').contents()[2].nodeValue;
const reviewCount = parseInt($('span.rating__caption').text(), 10);
const description = $('div[class*="description"] div.rte').text().trim();

await Dataset.pushData({
title,
description,
vendor,
price,
reviewCount,
description,
});
},
});

await crawler.addRequests([{
url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
label: 'start-url',
}]);

await crawler.run();
```

> Note that if you run this code, it may not work, as the proxies could potentially be down/non-operating at the time you are going through this course.
The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` array.

That's it! The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` option.
## Debugging proxies {#debugging-proxies}

## A bit about debugging proxies {#debugging-proxies}
To check that we're scraping through the proxies, we can get `proxyInfo` from the handler's context, which includes useful data about the proxy used to make the request.

At the time of writing, our above scraper utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request.
In the code example we already destructure the context object to `$` and `request`, so we can just add `proxyInfo` as something we want to access in the handler, too.

```js
const crawler = new CheerioCrawler({
Expand All @@ -118,15 +129,21 @@ const crawler = new CheerioCrawler({
});
```

After modifying your code to log `proxyInfo` to the console and running the scraper, you're going to see some logs which look like this:
After modifying the code to log `proxyInfo` and after running the scraper, we can see proxy details about each request made:

![Sample logs of proxyInfo](./images/proxy-info-logs.png)

![proxyInfo being logged by the scraper](./images/proxy-info-logs.png)
These logs confirm that Crawlee uses and automatically rotates the proxies. Such logs can be also useful for debugging slow or broken proxies.

These logs confirm that our proxies are being used and rotated successfully by Crawlee, and can also be used to debug slow or broken proxies.
## Carefree proxy scraping {#higher-level-proxy-scraping}

## Higher level proxy scraping {#higher-level-proxy-scraping}
If scraping and managing proxies on your own feels tedious, there are services which do that for you. One of them is [Apify Proxy](https://apify.com/proxy), which provides proxies with both residential and datacenter IP addresses. The integration with Crawlee is seamless, but first you need the Apify SDK:

```shell
npm install apify
```

Though we will discuss it more in-depth in future courses, it is still important to mention that Crawlee has integrated support for the Apify SDK, which supports [Apify Proxy](https://apify.com/proxy) - a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this:
Then you can create the `proxyConfiguration` like this:

```js
import { Actor } from 'apify';
Expand All @@ -136,7 +153,7 @@ const proxyConfiguration = await Actor.createProxyConfiguration({
});
```

Notice that we didn't provide it a list of proxy URLs. This is because the `SHADER` group already serves as our proxy pool (courtesy of Apify Proxy).
For more information about the integration refer to the [Apify SDK documentation](https://docs.apify.com/sdk/js/docs/guides/proxy-management).

## Next up {#next}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem';

---

Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../scraping_basics_javascript/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website.
Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../scraping_basics_javascript/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from [warehouse-theme-metal.myshopify.com](https://warehouse-theme-metal.myshopify.com/), a sample Shopify website.

> Most web data extraction cases involve looping through a list of items of some sort.

Expand All @@ -36,7 +36,7 @@ import { chromium } from 'playwright';
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://demo-webstore.apify.org/search/on-sale');
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');

// code will go here

Expand All @@ -54,7 +54,7 @@ import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://demo-webstore.apify.org/search/on-sale');
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');

// code will go here

Expand Down Expand Up @@ -82,16 +82,12 @@ We'll be returning a bunch of product objects from this function, which will be

```js
const products = await page.evaluate(() => {
const productCards = Array.from(document.querySelectorAll('a[class*="ProductCard_root"]'));
const productCards = Array.from(document.querySelectorAll('.product-item'));

return productCards.map((element) => {
const name = element.querySelector('h3[class*="ProductCard_name"]').textContent;
const price = element.querySelector('div[class*="ProductCard_price"]').textContent;

return {
name,
price,
};
const name = element.querySelector('.product-item__title').textContent;
const price = element.querySelector('.price').lastChild.textContent;
return { name, price };
});
});

Expand All @@ -100,7 +96,20 @@ console.log(products);

When we run this code, we see this logged to our console:

![Products logged to the console](./images/log-products.png)
```text
$ node index.js
[
{
name: 'JBL Flip 4 Waterproof Portable Bluetooth Speaker',
price: '$74.95'
},
{
name: 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV',
price: 'From $1,398.00'
},
...
]
```

## Using jQuery {#using-jquery}

Expand All @@ -118,19 +127,13 @@ Now, since we're able to use jQuery, let's translate our vanilla JavaScript code
await page.addScriptTag({ url: 'https://code.jquery.com/jquery-3.6.0.min.js' });

const products = await page.evaluate(() => {
const productCards = Array.from($('a[class*="ProductCard_root"]'));

return productCards.map((element) => {
const card = $(element);

const name = card.find('h3[class*="ProductCard_name"]').text();
const price = card.find('div[class*="ProductCard_price"]').text();

return {
name,
price,
};
});
const productCards = $('.product-item');
return productCards.map(function () {
const card = $(this);
const name = card.find('.product-item__title').text();
const price = card.find('.price').contents().last().text();
return { name, price };
}).get();
});

console.log(products);
Expand Down Expand Up @@ -178,7 +181,7 @@ import { load } from 'cheerio';
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://demo-webstore.apify.org/search/on-sale');
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');

const $ = load(await page.content());

Expand All @@ -197,7 +200,7 @@ import { load } from 'cheerio';
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://demo-webstore.apify.org/search/on-sale');
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');

const $ = load(await page.content());

Expand All @@ -214,19 +217,13 @@ Now, to loop through all of the products, we'll make use of the `$` object and l
```js
const $ = load(await page.content());

const productCards = Array.from($('a[class*="ProductCard_root"]'));

const products = productCards.map((element) => {
const card = $(element);

const name = card.find('h3[class*="ProductCard_name"]').text();
const price = card.find('div[class*="ProductCard_price"]').text();

return {
name,
price,
};
});
const productCards = $('.product-item');
const products = productCards.map(function () {
const card = $(this);
const name = card.find('.product-item__title').text();
const price = card.find('.price').contents().last().text();
return { name, price };
}).get();

console.log(products);
```
Expand All @@ -245,23 +242,17 @@ import { load } from 'cheerio';
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://demo-webstore.apify.org/search/on-sale');
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');

const $ = load(await page.content());

const productCards = Array.from($('a[class*="ProductCard_root"]'));

const products = productCards.map((element) => {
const card = $(element);

const name = card.find('h3[class*="ProductCard_name"]').text();
const price = card.find('div[class*="ProductCard_price"]').text();

return {
name,
price,
};
});
const productCards = $('.product-item');
const products = productCards.map(function () {
const card = $(this);
const name = card.find('.product-item__title').text();
const price = card.find('.price').contents().last().text();
return { name, price };
}).get();

console.log(products);

Expand All @@ -278,23 +269,17 @@ import { load } from 'cheerio';
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://demo-webstore.apify.org/search/on-sale');
await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales');

const $ = load(await page.content());

const productCards = Array.from($('a[class*="ProductCard_root"]'));

const products = productCards.map((element) => {
const card = $(element);

const name = card.find('h3[class*="ProductCard_name"]').text();
const price = card.find('div[class*="ProductCard_price"]').text();

return {
name,
price,
};
});
const productCards = $('.product-item');
const products = productCards.map(function () {
const card = $(this);
const name = card.find('.product-item__title').text();
const price = card.find('.price').contents().last().text();
return { name, price };
}).get();

console.log(products);

Expand Down
Binary file not shown.
Loading