scrapy best practice
pip install requirements.txt
|____bin #bash scripts
|____requirements.txt
|____scrappy
| |____dbs #storge dao
| |____extensions #scrapy extensions
| |____items
| |____middlewares
| |____resources #static resources
| |____scripts #py scripts
| |____services #py services
| |____spiders #spiders definition
| |____utils #python utils
|____scrapy.cfg
code some spider in spiders
-
extends CrawlSpider
-
define name
-
define start_urls or start_requests function
-
define parse function to parse the response
-
define models in items
-
define pipeline in pipelines
-
handleInsert.
parse the item before insert
-
handleUpdate.
parse the item before update
extends BaseSpider
-
CrawlSpider.
normal spider
the spider will distributly if set ENABLE_REDIS value to True in settings
-
scrappy.extensions.scrapy_redis.spiders.RedisSpider.
spider will not shutdown , always pop request form redis
-
ResourceHelper.
reading, wirting and creating files
-
RemoveCookieMiddleware.
remove cookie before request
-
RandomProxyMiddleware.
random switch proxy before request
-
UserAgentMiddleware.
random switch UserAgent before request
it will automatic switch configuration file (Linux as product platform)
-
ENABLE_REDIS.
Enable redis distribution , redis stat
have a nice day :)