This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.
License is MIT.
pip install scrapy-crawl-once
To enable it, modify your settings.py:
SPIDER_MIDDLEWARES = { # ... 'scrapy_crawl_once.CrawlOnceMiddleware': 100, # ... } DOWNLOADER_MIDDLEWARES = { # ... 'scrapy_crawl_once.CrawlOnceMiddleware': 50, # ... }
By default it does nothing. To avoid crawling a particular page
multiple times set request.meta['crawl_once'] = True
. When a response
is received and a callback is successful, the fingerprint of such request
is stored to a database. When spider schedules a new request middleware
first checks if its fingerprint is in the database, and drops the request
if it is there.
Other request.meta
keys:
crawl_once_value
- a value to store in DB. By default, timestamp is stored.crawl_once_key
- request unique id; by default request_fingerprint is used.
CRAWL_ONCE_ENABLED
- set it to False to disable middleware. Default is True.CRAWL_ONCE_PATH
- a path to a folder with crawled requests database. By default.scrapy/crawl_once/
path inside a project dir is used; this folder contains<spider_name>.sqlite
files with databases of seen requests.CRAWL_ONCE_DEFAULT
- default value forcrawl_once
meta key (False by default). When True, all requests are handled by this middleware unless disabled explicitly usingrequest.meta['crawl_once'] = False
.
https://github.com/scrapy-plugins/scrapy-deltafetch is a similar package; it does almost the same. Differences:
- scrapy-deltafetch chooses whether to discard a request or not based on
yielded items; scrapy-crawl-once uses an explicit
request.meta['crawl_once']
flag. - scrapy-deltafetch uses bsddb3, scrapy-crawl-once uses sqlite.
Another alternative is a built-in Scrapy HTTP cache. Differences:
- scrapy cache stores all pages on disc, scrapy-crawl-once only keeps request fingerprints;
- scrapy cache allows a more fine grained invalidation consistent with how browsers work;
- with scrapy cache all pages are still processed (though not all pages are downloaded).
- source code: https://github.com/TeamHG-Memex/scrapy-crawl-once
- bug tracker: https://github.com/TeamHG-Memex/scrapy-crawl-once/issues
To run tests, install tox and run tox
from the source checkout.