Skip to content
/ Seen Public

A lightweight crawling/spider framework for everyone(support JavaScript!).:sparkles:

Notifications You must be signed in to change notification settings

HuberTRoy/Seen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Seen

Seen is a lightweight web crawling framework for everyone. Written with asyncioaiohttp/requests.

It is useful for writing a web crawling quickly and get FULL JavaScript Support.

Working Process: workingProcess

Requirements:

  • Python 3.5+
  • aiohttp or requests
  • pyquery

Installation:

pip install seen

Get JavaScript support!

pip install pyppeteer

Usage:

  1. Write spider.py
from seen import Spider, Parser, Item, Css


class Post(Item):
    title = Css('title')
    img = Css('img', 'src')


    def save(self):

        print(self.result['title'])
        print(self.result['img'])


class MySpider(Spider):
    roots = 'https://www.v2ex.com'
    url_limit = ('www.v2ex.com')
    concurrency = 1
    # if you want to load JavaScript, set use_browser = True
    # by default is False.
    use_browser = False

    parsers = [Parser(Post)]


if __name__ == '__main__':
    spider = MySpider()

    spider.start()
  1. Run python spider.py.
  2. Check result.

Contribution

  • Pull request.
  • Open an issue.

About

A lightweight crawling/spider framework for everyone(support JavaScript!).:sparkles:

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages