This is a scraper function that automatically pulls in metadata from the page, as well as supports simple HTML querying using cheerio.
It's built on top of stdlib which makes it highly distributed and scalable.
You can either use the ready service that's deployed on stdlib here, or fork this repository and launch your own version on stdlib.
For example, a simple scrape to pick up my own email address from Github (and a bunch of extra metadata):
lib nemo.scrape --url https://github.com/nemo --query "li[itemprop='email'] a"
{ metadata:
{ general:
{ description: 'nemo has 36 repositories available. Follow their code on GitHub.',
title: 'nemo (Nima Gardideh) · GitHub',
lang: 'en' },
openGraph:
{ app_id: '1401488693436528',
image: [Object],
site_name: 'GitHub',
type: 'profile',
title: 'nemo (Nima Gardideh)',
url: 'https://github.com/nemo',
description: 'nemo has 36 repositories available. Follow their code on GitHub.',
username: 'nemo' },
schemaOrg: { items: [Object] },
twitter:
{ image: [Object],
site: '@github',
card: 'summary',
title: 'nemo (Nima Gardideh)',
description: 'nemo has 36 repositories available. Follow their code on GitHub.' } },
url: 'https://github.com/nemo',
query: 'li[itemprop=\'email\'] a',
query_value: 'nima@halfmoon.ws'
}
You can view the function specification here.
Note that this scraper does not support sites that are single page Javascript applications. You should also follow robot.txt rules when you're scraping websites. Use responsibly.
MIT