log.txt

1.scrapy 是基于twisted的框架,异步IO性能好, 便于扩展,内置很多功能,CSS和xpath; requests和beautifulsoup是库.
2.create virtualenv: pipenv install scrapy
3.reset interpreter: pipenv shell -> which python3 -> pycharm add local interpreter
4.pipenv shell -> scrapy startproject xxx -> cd xxx & scrapy genspider jobbole blog.jobbole.com

Created spider 'jobbole' using template 'basic' in module:
BoleSpider.spiders.jobbole

5. 1)scrapy.cmdline.excute 调试爬虫结果
from scrapy.cmdline import execute
execute(["scrapy", "crawl", "jobbole"])
   2)scrapy shell http://blog.jobbole.com/110287/ 终端调试
2019-01-01 16:24:58 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-01-01 16:24:58 [scrapy.core.engine] INFO: Spider opened
2019-01-01 16:24:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/110287/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10e1fca90>
[s]   item       {}
[s]   request    <GET http://blog.jobbole.com/110287/>
[s]   response   <200 http://blog.jobbole.com/110287/>
[s]   settings   <scrapy.settings.Settings object at 0x10f1ec710>
[s]   spider     <DefaultSpider 'default' at 0x10f4b2c18>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> result = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()');
>>> result.extract()[0];
result看做 monad 可以继续xpath分析查询, extract之后解包得值但已经不是Selector了

6. extract_first() 为空时有默认值 不会像 extract()[0] 抛错

7. [el for el in tags if not el ...]  filter

8. "//span[contains(@class, 'bookmark-btn')]/text()"  span 的 class 不唯一时, contains

9. yield Request(url=parse.urljoin(response.url, post_url), callback=self.parse_post) yield阻塞,待循环当前页面20个post_url结束后, 才会执行下面的next_page再爬取20个post_url.

10. items

不同于普通的model, items可以将属性yield给pipelines:

yield post_item

11. settings.py 配置 scrapy 图片下载

ITEM_PIPELINES = {
   'BoleSpider.pipelines.BolespiderPipeline': 300,
   'scrapy.pipelines.images.ImagesPipeline': 1,  //数字低的在管道的最开端
}

IMAGES_URLS_FIELD = "preview_img"  //要处理的图片的字段
project_dir = os.path.abspath(os.path.dirname(__file__)) //相对路径
IMAGES_STORE = os.path.join(project_dir, 'images')  // 保存图片的文件夹


12. hashlib.md5  Python3以后所有的字符都是Unicode, 要encode("utf-8")之后才能被MD5

Unicode-object must be encoded before hashing


13. twisted 支持连接池异步存储MySQL数据

from twisted.enterprise import adbapi
dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)
self.dbpool.runInteraction(self.to_insert, item)


14. itemloader 中对 add_css 来的 ite m经过 input_processor,output_processor的 MapCompose (MapCompose(按顺序执行括号里面的函数处理 item))成为数据库所需的数据 再 yield 给 pipelines(image, json, mysql等). 类似于JS proxy的get/set hook

15. takefirst 类似 extract()[0]

16. scrapy genspider --list  basic/crawl/csvfeed/xmlfeed

17. 通过stats属性来使用数据收集器. dispatch.connect(stats_handler, signals.close) 可以在spider结束时打印搜集结果.

18. suggesters:
term suggester/phrase suggester/ completion suggester/ context suggester
PUT music
{
    "mappings": {
        "_doc" : {
            "properties" : {
                "suggest" : {
                    "type" : "completion"
                },
                "title" : {
                    "type": "keyword"
                }
            }
        }
    }
}

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "python网络开发工程师"
}
PUT music/_doc/1?refresh
{
    "suggest" : {
        "input": [ "Nevermind", "Nirvana" ],
        "weight" : 34
    }
}


-----------------------------------------------------------------------------------------------------------
##response.css/response.xpath


    post_item = BolePostItem()

    css选择器
    title = response.css(".entry-header h1::text").extract_first().strip()
    create_date = response.css(".entry-meta-hide-on-mobile::text").extract_first().strip().replace("·", "").strip()
    votes = int(response.css(".vote-post-up h10::text").extract_first())
    bookmarks = response.css(".bookmark-btn::text").extract_first()
    match_re = re.match(".*?(\d+).*", bookmarks)
    if match_re:
        bookmarks = int(match_re.group(1))
    else:
        bookmarks = 0

    comments = response.css("a[href='#article-comment'] span::text").extract_first()
    match_re = re.match(".*?(\d+).*", comments)
    if match_re:
        comments = int(match_re.group(1))
    else:
        comments = 0

    body = response.css("div.entry").extract_first()

    tags = response.css("p.entry-meta-hide-on-mobile a::text").extract()
    tags = [el for el in tags if not el.strip().endswith("评论")]
    tags = ",".join(tags)
    preview_img = response.meta.get("preview_img", "")

    try:
        create_date = datetime.datetime.strptime(create_date, "%Y/%m/%d").date()
    except Exception as e:
        create_date = datetime.datetime.now().date()

    post_item["create_date"] = create_date
    post_item["title"] = title
    post_item["url"] = response.url
    post_item["preview_img"] = [preview_img]
    post_item["votes"] = votes
    post_item["comments"] = comments
    post_item["bookmarks"] = bookmarks
    post_item["body"] = body
    post_item["tags"] = tags
    post_item["url_object_id"] = to_md5(response.url)


    title = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()').extract_first()
    create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract_first().strip().replace("·", "").strip()
    votes =response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract_first()

    favs = response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").extract_first()
    match_re = re.match(".*?(\d+).*", favs)
    if match_re:
        favs = match_re.group(1)

    comments = response.xpath("//a[@href='#article-comment']/span/text()").extract_first()
    match_re = re.match(".*?(\d+).*", comments)
    if match_re:
        comments = match_re.group(1)

    body = response.xpath("//div[@class='entry']").extract_first()
    tags = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
    tags = [el for el in tags if not el.strip().endswith("评论")]
    tags = ",".join(tags)