-
Notifications
You must be signed in to change notification settings - Fork 1
/
log.txt
188 lines (136 loc) · 6.7 KB
/
log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
1.scrapy 是基于twisted的框架,异步IO性能好, 便于扩展,内置很多功能,CSS和xpath; requests和beautifulsoup是库.
2.create virtualenv: pipenv install scrapy
3.reset interpreter: pipenv shell -> which python3 -> pycharm add local interpreter
4.pipenv shell -> scrapy startproject xxx -> cd xxx & scrapy genspider jobbole blog.jobbole.com
Created spider 'jobbole' using template 'basic' in module:
BoleSpider.spiders.jobbole
5. 1)scrapy.cmdline.excute 调试爬虫结果
from scrapy.cmdline import execute
execute(["scrapy", "crawl", "jobbole"])
2)scrapy shell http://blog.jobbole.com/110287/ 终端调试
2019-01-01 16:24:58 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-01-01 16:24:58 [scrapy.core.engine] INFO: Spider opened
2019-01-01 16:24:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/110287/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x10e1fca90>
[s] item {}
[s] request <GET http://blog.jobbole.com/110287/>
[s] response <200 http://blog.jobbole.com/110287/>
[s] settings <scrapy.settings.Settings object at 0x10f1ec710>
[s] spider <DefaultSpider 'default' at 0x10f4b2c18>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> result = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()');
>>> result.extract()[0];
result看做 monad 可以继续xpath分析查询, extract之后解包得值但已经不是Selector了
6. extract_first() 为空时有默认值 不会像 extract()[0] 抛错
7. [el for el in tags if not el ...] filter
8. "//span[contains(@class, 'bookmark-btn')]/text()" span 的 class 不唯一时, contains
9. yield Request(url=parse.urljoin(response.url, post_url), callback=self.parse_post) yield阻塞,待循环当前页面20个post_url结束后, 才会执行下面的next_page再爬取20个post_url.
10. items
不同于普通的model, items可以将属性yield给pipelines:
yield post_item
11. settings.py 配置 scrapy 图片下载
ITEM_PIPELINES = {
'BoleSpider.pipelines.BolespiderPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1, //数字低的在管道的最开端
}
IMAGES_URLS_FIELD = "preview_img" //要处理的图片的字段
project_dir = os.path.abspath(os.path.dirname(__file__)) //相对路径
IMAGES_STORE = os.path.join(project_dir, 'images') // 保存图片的文件夹
12. hashlib.md5 Python3以后所有的字符都是Unicode, 要encode("utf-8")之后才能被MD5
Unicode-object must be encoded before hashing
13. twisted 支持连接池异步存储MySQL数据
from twisted.enterprise import adbapi
dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)
self.dbpool.runInteraction(self.to_insert, item)
14. itemloader 中对 add_css 来的 ite m经过 input_processor,output_processor的 MapCompose (MapCompose(按顺序执行括号里面的函数处理 item))成为数据库所需的数据 再 yield 给 pipelines(image, json, mysql等). 类似于JS proxy的get/set hook
15. takefirst 类似 extract()[0]
16. scrapy genspider --list basic/crawl/csvfeed/xmlfeed
17. 通过stats属性来使用数据收集器. dispatch.connect(stats_handler, signals.close) 可以在spider结束时打印搜集结果.
18. suggesters:
term suggester/phrase suggester/ completion suggester/ context suggester
PUT music
{
"mappings": {
"_doc" : {
"properties" : {
"suggest" : {
"type" : "completion"
},
"title" : {
"type": "keyword"
}
}
}
}
}
GET _analyze
{
"analyzer": "ik_smart",
"text": "python网络开发工程师"
}
PUT music/_doc/1?refresh
{
"suggest" : {
"input": [ "Nevermind", "Nirvana" ],
"weight" : 34
}
}
-----------------------------------------------------------------------------------------------------------
##response.css/response.xpath
post_item = BolePostItem()
css选择器
title = response.css(".entry-header h1::text").extract_first().strip()
create_date = response.css(".entry-meta-hide-on-mobile::text").extract_first().strip().replace("·", "").strip()
votes = int(response.css(".vote-post-up h10::text").extract_first())
bookmarks = response.css(".bookmark-btn::text").extract_first()
match_re = re.match(".*?(\d+).*", bookmarks)
if match_re:
bookmarks = int(match_re.group(1))
else:
bookmarks = 0
comments = response.css("a[href='#article-comment'] span::text").extract_first()
match_re = re.match(".*?(\d+).*", comments)
if match_re:
comments = int(match_re.group(1))
else:
comments = 0
body = response.css("div.entry").extract_first()
tags = response.css("p.entry-meta-hide-on-mobile a::text").extract()
tags = [el for el in tags if not el.strip().endswith("评论")]
tags = ",".join(tags)
preview_img = response.meta.get("preview_img", "")
try:
create_date = datetime.datetime.strptime(create_date, "%Y/%m/%d").date()
except Exception as e:
create_date = datetime.datetime.now().date()
post_item["create_date"] = create_date
post_item["title"] = title
post_item["url"] = response.url
post_item["preview_img"] = [preview_img]
post_item["votes"] = votes
post_item["comments"] = comments
post_item["bookmarks"] = bookmarks
post_item["body"] = body
post_item["tags"] = tags
post_item["url_object_id"] = to_md5(response.url)
title = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()').extract_first()
create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract_first().strip().replace("·", "").strip()
votes =response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract_first()
favs = response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").extract_first()
match_re = re.match(".*?(\d+).*", favs)
if match_re:
favs = match_re.group(1)
comments = response.xpath("//a[@href='#article-comment']/span/text()").extract_first()
match_re = re.match(".*?(\d+).*", comments)
if match_re:
comments = match_re.group(1)
body = response.xpath("//div[@class='entry']").extract_first()
tags = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
tags = [el for el in tags if not el.strip().endswith("评论")]
tags = ",".join(tags)