Cheers

Scrape a website efficiently, block by block, page by page.

Motivations

This is a Cheerio based scraper, useful to extract data from a website using CSS selectors.
The motivation behind this package is to provide a simple cheerio-based scraping tool, able to divide a website into blocks, and transform each block into a JSON object using CSS selectors.

Built on top of these excellent modules :

https://github.com/cheeriojs/cheerio
https://github.com/chriso/curlrequest
https://github.com/kriskowal/q

CSS mapping syntax inspired by :

https://github.com/dharmafly/noodle

Getting Started

Install the module with: npm install cheers

Usage

Configuration options:

config.url : the URL to scrape (single URL, or array of URLs, or sitemap.xml)
config.blockSelector : the CSS selector to apply on the page to divide it in scraping blocks. This field is optional (will use "body" by default)
config.scrape : the definition of what you want to extract in each block. Each key has two mandatory attributes : selector (a CSS selector or . to stay on the current node) and extract. The possible values for extract are text, html, outerHTML, a RegExp or the name of an attribute of the html element (e.g. "href")
config.curlOptions : additionnal options you want to pass to curl. See the documentation from https://github.com/chriso/curlrequest for more information.
config.curlOptions : additionnal options you want to pass to curl. See the documentation from https://github.com/chriso/curlrequest for more information.
config.blacklist : an array of URL to ignore (for sitemap scraping).
config.verbose : show more logs when scraping (for debugging purpose).

var cheers = require('cheers');

//let's scrape this excellent JS news website
var config = {
    url: "http://www.echojs.com/",
    curlOptions: {
        'useragent': 'Cheers'
    },
    blockSelector: "article",
    scrape: {
        title: {
            selector: "h2 a",
            extract: "text"
        },
        link: {
            selector: "h2 a",
            extract: "href"
        },
        articleInnerHtml: {
            selector: ".",
            extract: "html"
        },
        articleOuterHtml: {
            selector: ".",
            extract: "outerHTML"
        },
        articlePublishedTime: {
            selector: 'p',
            extract: /\d* (?:hour[s]?|day[s]?) ago/
        }
    }
};

cheers.scrape(config).then(function (results) {
    console.log(JSON.stringify(results));
}).catch(function (error) {
    console.error(error);
});

Shell script

Instead of using cheers with javascript, you can also use the provided shell script that encapsulates the library. To install the shell script globally on your system, please run the command npm install cheers -g or npm install cheers --global

You'll then be able to use cheers command from a terminal.

Cheers will scrape the content according to a config file similar to what is described in the above documentation, except it will take the form of a JSON file.

Example of config file (same config as above) :

config.json :

{
    "url": "http://www.echojs.com/",
    "blockSelector": "article",
    "scrape": {
        "title": {
            "selector": "h2 a",
            "extract": "text"
        },
        "link": {
            "selector": "h2 a",
            "extract": "href"
        },
        "articleInnerHtml": {
            "selector": ".",
            "extract": "html"
        },
        "articleOuterHtml": {
            "selector": ".",
            "extract": "outerHTML"
        },
        "articlePublishedTime": {
            "selector": "p",
            "extract": "/\\d* (?:hour[s]?|day[s]?) ago/"
        }
    }
}

The main difference is found when you want to use a regular expression, you have to escape all the \ to respect the JSON format.

Usage example :

cheers -conf /directory/config.json

Unit tests

Tests can be run by typing the command npm test

If you don't want to use the test dependencies, please use npm install --production when installing.

Roadmap

~~Option to change the user agent~~
~~Command line tool~~
~~Unit tests~~
~~Array of URLs~~
~~Start from sitemap~~
Website pagination
Option to use request instead of curl
Option to use a headless browser

Contributors

Cheers!

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
example		example
lib		lib
test		test
.gitignore		.gitignore
MIT-LICENSE.md		MIT-LICENSE.md
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cheers

Motivations

Built on top of these excellent modules :

CSS mapping syntax inspired by :

Getting Started

Usage

Shell script

Example of config file (same config as above) :

Usage example :

Unit tests

Roadmap

Contributors

License

About

Releases

Packages

Contributors 5

Languages

License

fallanic/cheers

Folders and files

Latest commit

History

Repository files navigation

Cheers

Motivations

Built on top of these excellent modules :

CSS mapping syntax inspired by :

Getting Started

Usage

Shell script

Example of config file (same config as above) :

Usage example :

Unit tests

Roadmap

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages