Skip to content

loodeer/xiaohongshuSpider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xiaohongshuSpider

Just for testing or learning usage.Please do not use it for illegal purposes. The author is not responsible for the consequences.

scrapy环境安装及入门见 Scrapy 0.24 文档

shell中调试response元素的xpath

scrapy shell http://m.xiaohongshu.com 在命令行模式下匹配所需各元素的xpath,方便后续爬取数据。 可参考 在spider中启动shell来查看response

首页顶部banner图:

response.xpath('//div[contains(@class, "banner")]//img/@src').extract()

文章封面图:

response.xpath('//div[contains(@class, "dual-column-layout")]//div[contains(@class, "note-item")]//div[contains(@class, "note-cover")]//img/@src').extract() 前缀需要处理

跳转链接:

response.xpath('//div[contains(@class, "dual-column-layout")]//div[contains(@class, "note-item")]/a/@href').extract() 需要拼接前缀

标题:

response.xpath('//div[contains(@class, "dual-column-layout")]//div[contains(@class, "note-item")]//h3[contains(@class, "note-title")]/text()').extract()

简介:

response.xpath('//div[contains(@class, "dual-column-layout")]//div[contains(@class, "note-item")]//p[contains(@class, "note-desc")]/text()').extract()

作者:

response.xpath('//div[contains(@class, "dual-column-layout")]//div[contains(@class, "note-item")]//div[contains(@class, "note-author")]//a[contains(@class, "note-author-nickname")]/text()').extract()

头像:

response.xpath('//div[contains(@class, "dual-column-layout")]//div[contains(@class, "note-item")]//div[contains(@class, "note-author")]//img[contains(@class, "avatar-img")]/@src').extract()

点赞数:

response.xpath('//div[contains(@class, "dual-column-layout")]//div[contains(@class, "note-item")]//span[contains(@class, "note-likes")]//text()').extract() 前后有回车符号

爬虫代码写完后,执行 scrapy crawl xiaohongshu -o items_index.json 即可将结果保存到 json 文件内。

Releases

No releases published

Packages

No packages published

Languages