Skip to content

Fievel90/WebSpider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebSpider

This repository wraps Guzzle and some Symfony components providing an easy way for spidering websites.

Requirements

  • PHP >=5.5
  • Guzzle >= 6.0
  • Doctrine ORM >= 2.2
  • Symfony Components >= 2.7

Installation

Add fievel/webspider as a require dependency in your composer.json file:

composer require fievel/webspider

Usage

Extend class WebSpiderAbstract as needed implementing these methods:

getDataFromResponse: used to extract data from response, default behaviour treats body as plain text;

protected function getDataFromResponse(ResponseInterface $response)
{
    return (string) $response->getBody();
}

parseData: used to extract data information, it's possible to initialize Symfony DomCrawler if needed;

protected function parseData($data)
{
    $this->crawler->addHtmlContent($data);

    $node = $this->crawler->filter('input');

    $value = null;
    if ($node->count() > 0) {
        $value = $node->first()->attr('value');
    }

    return $value;
}

handleException: used to handle Guzzle exceptions;

protected function handleException(\Exception $e)
{
    return null;
}

The only remaining thing to do is launch the spider created, in order to do that you can use the SpiderManager service.

$manager = $this->container->get('fievel_web_spider.manager.spider');
$manager->setLogger($this->logger);

$response = null;
try {
    $response = $manager->runSpider([
        AppBundle\Spiders\CustomSpider::class,  // Spider class created
        'http://localhost/test-spider',         // URL to spidering
        'post',                                 // Http method supported by Guzzle
        ['cookies' => true],                    // Custom config supported by Guzzle Client
        [                                       // Custom options supported by Guzzle Client
            RequestOptions::FORM_PARAMS => [
                'full_name' => 'John Doe'
            ]
        ]
    ]);
} catch(\Exception $e) {
}

Features

It's possible to share a storage between subsequent spiders call.

$storage = new SpiderStorage();
$storage->add($sharedData);

$response = $manager->runSpider([
    AppBundle\Spiders\CustomSpider::class,  // Spider class created
    'http://localhost/test-spider',         // URL to spidering
    'post',                                 // Http method supported by Guzzle
    ['cookies' => true],                    // Custom config supported by Guzzle Client
    [                                       // Custom options supported by Guzzle Client
        RequestOptions::FORM_PARAMS => [
            'full_name' => 'John Doe'
        ]
    ],
    $storage                                // Shared storage
]);

It's even possible to create queues and leave the entire execution to the manager.

$queue = new SpiderCallQueue();

$queue->enqueue(
    AppBundle\Spiders\FirstPageSpider::class,
    'http://localhost/test-spider',
    'post',
    ['cookies' => true],
    [
        RequestOptions::FORM_PARAMS => [
            'full_name' => 'John Doe'
        ]
    ]
);
$queue->enqueue(
    AppBundle\Spiders\SecondPageSpider::class,
    'http://localhost/test-spider',
    'get',
    ['cookies' => true],
    []
);

$response = $manager->runSpiderQueue($queue);

Last but not least, the SpiderManager will handle retries on failure using a custom GuzzleMiddleware.

Proxy

Links

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published