rs-pider

The Rust ecosystem currently has a rich variety of high quality parsers afforded to it by packages like serde. Both robots.txt parsing crates and sitemaps.xml parsers exist, the next logical step in rolling your own web spider in Rust is combining the two in order to take advantage of all the owner-supplied metadata of a given site.

Usage

Don't, yet.

Function

The current state of rs-pider is quite rough. It's a single structure, meta::SiteMeta, which handles recursively fetching all sitemap.xml files discoverable on a given domain using the robots.txt file as an entry point.

Todo before cargo release

(a) Add filters both for sitemap.xml paths and url paths passed out (make a UrlFilter module)
(z) Care about robots.txt exclusions
(b) Think about how long-running spiders are going to reintegrate stale, already fetched or failed urls
(b) Add conditional fetch based on HEAD hashes
(c) Split off and generalize the segmented data structure for later use with the UrlEntry type

Hes: (z) (c)
Odi: (a) (b)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
tests		tests
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rs-pider

Usage

Function

Todo before cargo release

About

Releases

Packages

Languages

BradyMcd/rs-pider

Folders and files

Latest commit

History

Repository files navigation

rs-pider

Usage

Function

Todo before cargo release

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages