How can node-scrapy be used recursively to crawl a site? #23
-
Hi, would you please add a recursive example ? Thanks! |
Beta Was this translation helpful? Give feedback.
Answered by
stefanmaric
Dec 14, 2020
Replies: 1 comment
-
Hi @mariusa, Even tho we use For http fetching there's a plethora of options (request, got, axios, node-fetch, etc). Here a quick example I put together with const fs = require('fs')
const path = require('path')
// need to be installed in the project
const fetch = require('node-fetch')
const { extract } = require('node-scrapy')
const wait = () =>
new Promise((resolve) => {
setTimeout(resolve, Math.round(Math.random() * 10000))
})
const LINKS_STORE = {}
const START_URL = 'https://en.wikipedia.org/wiki/Printmaking'
const crawl = async (url) => {
console.log(`Fetching: ${url}`)
// Don't choke Wikipedia's servers, but this must be replaced by an actual queue with a parallel limit
await wait()
const response = await fetch(url, {
headers: {
// Wikipedia needs user-agent to not block requests right ahead.
'User-Agent':
'Mariusa/1.0 (http://mariusa.github.io/crawler/; crawler@mariusa.github.io) used-base-library/1.0',
},
})
if (!response.ok) {
console.log(`Failed to fetch: ${url}`)
console.dir(response)
return
}
const body = await response.text()
const links = extract(body, [
// Get links only from the right-hand sidebars, two kinds of it
`.infobox a[href^="/wiki/"], .vertical-navbox a[href^="/wiki/"] (href | normalizeWhitespace | trim | prefix:"https://en.wikipedia.org")`,
])
if (!links) {
console.log(`No links found in page: ${url}`)
return
}
console.log(`${links.length} links found at: ${url}`)
LINKS_STORE[url] = links
for (let link of links) {
if (link in LINKS_STORE) {
console.log(`Skipping URL because it was crawled already: ${url}`)
} else {
crawl(link)
}
}
}
const writeResults = (() => {
let writing
return () => {
if (writing) {
return
}
writing = true
const filename = path.join(__dirname, 'result.json')
fs.writeFileSync(filename, JSON.stringify(LINKS_STORE, null, 2), 'utf-8')
console.log(`Results saved to ${filename}`)
process.exit(process.exitCode)
}
})()
process.on('exit', writeResults)
process.on('SIGINT', writeResults)
crawl(START_URL) Hope this helps you. |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
mariusa
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi @mariusa,
Even tho we use
node-scrapy
for crawling at Eeshi, it is focused on scraping part. It doesn't provide anything at the network layer and less so for the crawling logic.For http fetching there's a plethora of options (request, got, axios, node-fetch, etc). Here a quick example I put together with
node-fetch
: