- 总体框架划分四部分(一总控,三爬虫)灵活分配
- 爬虫皆为分布式部署,解决带宽和性能瓶颈
- proxy_pool解决ip封禁
- 禁用cookie防止浏览器记忆爬虫
- mysql底层数据存储
- Python 3.6
- Scrapy 1.4
- pymysql
- json
- redis
git clone https://github.com/Dengqlbq/JDSpider.git
Override the following content
- ProjectStart/Test.py (redis configuration, keywords, page_count)
- JDUrlsSpider/settings.py (redis configuration)
- JDDetailSpider/settings.py (redis configuration, mysql configuration, DOWNLOAD_DELAY)
- JDCommentSpider/settings.py (redis configuratin, mysql configuration, DOWNLOAD_DELAY)
cd ProjectStart
python Test.py
cd JDUrlsSpider
scrapy crawl JDUrlsSpider
cd JDDetailSpider
scrapy crawl JDDetailSpider
(This is distributed crawler, you can run more than one JDDetailSpider)
cd JDCommentSpider
scrapy crawl JDCommentSpider
(This is distributed crawler, you can run more than one JDCommentSpider)
Note:
- Before you run the project, make sure that you have created tables match the requirement.
- If you did not build a proxy_pool, disable the "ProxyMiddleware" in JDCommetSpider/settings.py
Product detail and comment summary
Some comments
Full comment