Releases: spider-rs/spider
v1.80.15
Whats Changed
- feat(depth): add crawl depth level control
- feat(redirect): add redirect limit expose with server respects
- feat(redirect): add redirect policy Loose & Strict
- perf(control): add rwlock crawl control
Example:
extern crate spider;
use spider::{tokio, website::Website, configuration::RedirectPolicy};
use std::io::Error;
#[tokio::main]
async fn main() -> Result<(), Error> {
let mut website = Website::new("https://rsseau.fr")
.with_depth(3)
.with_redirect_limit(4)
.with_redirect_policy(RedirectPolicy::Strict)
.build()
.unwrap();
website.crawl().await;
let links = website.get_links();
for link in links {
println!("- {:?}", link.as_ref());
}
println!("Total pages: {:?}", links.len());
Ok(())
}
Full Changelog: v1.80.3...v1.80.15
v1.80.3
What's Changed
- feat(cache): add caching backend feat flag by @j-mendez in #156
- chore(chrome_intercept): fix intercept redirect initial domain
- perf(chrome_intercept): improve intercept handling of assets
Example:
Make sure to have the feat flag [cache
] enabled. Storing cache in memory can be done with the flag [cache_mem
] instead of using disk space.
extern crate spider;
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
// we can use the builder method to enable caching or set `website.cache` to true directly.
let mut website: Website = Website::new("https://rsseau.fr")
.with_caching(true)
.build()
.unwrap();
website.crawl().await;
println!("Links found {:?}", website.get_links().len());
/// next run to website.crawl().await; will be faster since content is stored on disk.
}
Full Changelog: v1.70.4...v1.80.3
v1.71.5
Whats Changed
Spider(Core)
Request interception can be done by enabling [chrome_intercept]
and setting website.chrome_intercept
. This will block all resources that are not related to the domain speeding up the request when using Chrome.
Ex:
//! `cargo run --example chrome --features chrome_intercept`
extern crate spider;
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let block_images = true;
let mut website: Website = Website::new("https://rsseau.fr")
.with_chrome_intercept(true, block_images)
.build()
.unwrap();
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(page) = rx2.recv().await {
println!("{:?}", page.get_url());
}
});
website.crawl().await;
println!("Links found {:?}", website.get_links().len());
}
CLI
Request interception can be done using the arg block_images
and enabling the [chrome_intercept]
feature flag.
Ex: --block_images
Full Changelog: v1.60.12...v1.70.5
v1.60.13
What's Changed
This release brings a new feature flag (smart
), performance improvements, and fixes.
- feat(smart): add feat flag smart for smart mode. Default request to HTTP until JavaScript rendering is needed
- perf(crawl): add clone external checking
- chore(chrome): fix chrome connection socket keep alive on remote connections
- feat(chrome_store_page): add feat flag chrome_store_page and screenshot helper
- chore(decentralize): fix glob build
- feat(redirect): add transparent top redirect handling
Smart Mode
Smart mode brings the best of both worlds when crawling. It runs HTTP request first until JS page Rendering is required with Chrome.
Screenshots
Taking a screenshot manually can be done with the [chrome_store_page]
feature flag.
extern crate spider;
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(page) = rx2.recv().await {
println!("Screenshotting: {:?}", page.get_url());
let full_page = false;
let omit_background = true;
page.screenshot(full_page, omit_background).await;
// output is stored by default to ./storage/ use the env variable SCREENSHOT_DIRECTORY to adjust the path.
}
});
website.crawl().await;
println!("Links found {:?}", website.get_links().len());
}
Full Changelog: v1.50.20...v1.60.13
v1.50.20
Whats Changed
- feat(chrome): add
chrome_screenshot
feature flag - chore(control): fix control task abort after crawl
- chore(website): add website.stop handling shutdown
Full Changelog: v1.50.2...v1.50.20
v1.50.5
What's Changed
You can now run a cron job at anytime to sync data from the crawls. Use the cron with subscribe
to handle data curation with ease.
- feat(cron): add cron feature flag by @j-mendez in #153
- chore(tls): add optional native tls
- feat(napi): add napi support for nodejs
[dependencies]
spider = { version = "1.50.0", features = ["sync", "cron"] }
extern crate spider;
use spider::website::{Website, run_cron};
use spider::tokio;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://choosealicense.com");
// set the cron to run or use the builder pattern `website.with_cron`.
website.cron_str = "1/5 * * * * *".into();
let mut rx2 = website.subscribe(16).unwrap();
let join_handle = tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("{:?}", res.get_url());
}
});
// take ownership of the website. You can also use website.run_cron, except you need to perform abort manually on handles created.
let runner = run_cron(website).await;
println!("Starting the Runner for 10 seconds");
tokio::time::sleep(tokio::time::Duration::from_secs(10)).await;
let _ = tokio::join!(runner.stop(), join_handle);
}
Full Changelog: v1.49.10...v1.50.5
v1.49.12
Whats Changed
- feat(cookies): add cookie jar optional feature
You can set a cookie String directly with website.cookie_str
that is added for each request. Using the cookie feature also enables storing cookies that are received.
Full Changelog: v1.49.10...v1.49.12
v1.49.10
Whats Changed
- chore(chrome): fix chrome headless headful args
- chore(cli): add http check cli website url
- chore(cli): rename domain arg - url [#150]
- chore(cli): add invalid website error log
- Return status code on error by @marlonbaeten in #151
- chore(chrome): add main chromiumoxide crate - ( fork changes merged to the main repo )
- chore(chrome): fix headful browser open
- chore(website): add crawl_concurrent_raw method by @j-mendez in #152
- chore(deps): bump tokio@1.34.0
Thank you @marlonbaeten for help!
Full Changelog: v1.48.0...v1.49.10