- Installing Prerequisites
- Making an HTTP GET request
- Web scraping in PHP with Goutte
- Web scraping with Symfony Panther
PHP is a general-purpose scripting language and one of the most popular options for web development. For example, WordPress, the most common content management system to create websites, is built using PHP.
PHP offers various building blocks required to build a web scraper, although it can quickly become an increasingly complicated task. Conveniently, there are many open-source libraries that can make web scraping with PHP more accessible.
This article will guide you through the step-by-step process of writing various PHP web scraping routines that can extract public data from static and dynamic web pages
For a detailed explanation, see our blog post.
# Windows
choco install php
choco install composer
or
# macOS
brew install php
brew install composer
<?php
$html = file_get_contents('https://books.toscrape.com/');
echo $html;
composer init --no-interaction --require="php >=7.1"
composer require fabpot/goutte
composer update
<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com');
echo $crawler->html();
echo $crawler->filter('title')->text(); //CSS
echo $crawler->filterXPath('//title')->text(); //XPath
function scrapePage($url, $client){
$crawler = $client->request('GET', $url);
$crawler->filter('.product_pod')->each(function ($node) {
$title = $node->filter('.image_container img')->attr('alt');
$price = $node->filter('.price_color')->text();
echo $title . "-" . $price . PHP_EOL;
});
}
function scrapePage($url, $client, $file)
{
//...
// Handling Pagination
try {
$next_page = $crawler->filter('.next > a')->attr('href');
} catch (InvalidArgumentException) { //Next page not found
return null;
}
return "https://books.toscrape.com/catalogue/" . $next_page;
}
function scrapePage($url, $client, $file)
{
$crawler = $client->request('GET', $url);
$crawler->filter('.product_pod')->each(function ($node) use ($file) {
$title = $node->filter('.image_container img')->attr('alt');
$price = $node->filter('.price_color')->text();
fputcsv($file, [$title, $price]);
});
try {
$next_page = $crawler->filter('.next > a')->attr('href');
} catch (InvalidArgumentException) { //Next page not found
return null;
}
return "https://books.toscrape.com/catalogue/" . $next_page;
}
$client = new Client();
$file = fopen("books.csv", "a");
$nextUrl = "https://books.toscrape.com/catalogue/page-1.html";
while ($nextUrl) {
echo "<h2>" . $nextUrl . "</h2>" . PHP_EOL;
$nextUrl = scrapePage($nextUrl, $client, $file);
}
fclose($file);
composer init --no-interaction --require="php >=7.1"
composer require symfony/panther
composer update
brew install chromedriver
<?php
require 'vendor/autoload.php';
use \Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$client->get('https://quotes.toscrape.com/js/');
$crawler = $client->waitFor('.quote');
$crawler->filter('.quote')->each(function ($node) {
$author = $node->filter('.author')->text();
$quote = $node->filter('.text')->text();
echo $autor." - ".$quote
});
while (true) {
$crawler = $client->waitFor('.quote');
…
try {
$client->clickLink('Next');
} catch (Exception) {
break;
}
}
$file = fopen("quotes.csv", "a");
while (true) {
$crawler = $client->waitFor('.quote');
$crawler->filter('.quote')->each(function ($node) use ($file) {
$author = $node->filter('.author')->text();
$quote = $node->filter('.text')->text();
fputcsv($file, [$author, $quote]);
});
try {
$client->clickLink('Next');
} catch (Exception) {
break;
}
}
fclose($file);
If you wish to find out more about web scraping with PHP, see our blog post.