CrawlerFlow

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

Features

[*] Write spiders in the YAML configs.
[*] Create extractors to scrape data using YAML configs (HTML, API, RSS)
[*] Define multiple extractors per spider.
[*] Use standard extractors to scrape data like Tables, Paragraphs, Meta tags, JSON+LD of the page.
Traverse between multiple websites.
Write Python Extractors for advanced extraction strategy

Installation

pip install git+https://github.com/invana/crawlerflow#egg=crawlerflow

Usage

Scraping with CrawlerFlow

from crawlerflow.runner import Crawlerflow
from crawlerflow.utils import yaml_to_json


crawl_requests = yaml_to_json(open("example-configs/crawlerflow/requests/github-detail-urls.yml"))
spider_config = yaml_to_json(open("example-configs/crawlerflow/spiders/default-spider.yml"))
github_default_extractor = yaml_to_json(open("example-configs/crawlerflow/extractors/github-blog-detail.yml"))

flow = Crawlerflow()
flow.add_spider_with_config(crawl_requests, spider_config, default_extractor=github_default_extractor)
flow.start()

Scraping with WebCrawler

from crawlerflow.runner import WebCrawler
from crawlerflow.utils import yaml_to_json

 
scraper_config_files = [
    "example-configs/webcrawler/APISpiders/api-publicapis-org.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-list.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-detail.yml"
]

crawlerflow = WebCrawler()

for scraper_config_file in scraper_config_files:
    scraper_config = yaml_to_json(open(scraper_config_file))
    crawlerflow.add_spider_with_config(scraper_config)
crawlerflow.start()

Refer examples-configs/ folder for example configs.

Available Extractors

[*] HTMLExtractor
[*] MetaTagExtractor
[*] JSONLDExtractor
[*] TableContentExtractor
[*] IconsExtractor

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
example-configs		example-configs
examples		examples
src/crawlerflow		src/crawlerflow
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
conftest.py		conftest.py
pytest.ini		pytest.ini
runtests.py		runtests.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrawlerFlow

Features

Installation

Usage

Scraping with CrawlerFlow

Scraping with WebCrawler

Available Extractors

About

Releases

Packages

Languages

invana/crawlerflow

Folders and files

Latest commit

History

Repository files navigation

CrawlerFlow

Features

Installation

Usage

Scraping with CrawlerFlow

Scraping with WebCrawler

Available Extractors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages