Releases: spider-rs/spider
v1.98.4
Whats Changed
You can now set the max bytes limit for resources by using the env variable SPIDER_MAX_SIZE_BYTES
.
example for 1gb limit:
export SPIDER_MAX_SIZE_BYTES=1073741824
Full Changelog: v1.98.3...v1.98.4
v1.98.3
Whats Changed
- Fix sitemap regex compile.
- Whitelisting routes to only get the paths you want can now be done with
website.with_whitelist_url
example: - Fix robot disallow respect
use spider::{tokio, website::Website};
use tokio::io::AsyncWriteExt;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr/en/");
website.with_whitelist_url(Some(vec!["/books".into()]));
let mut rx2: tokio::sync::broadcast::Receiver<spider::page::Page> =
website.subscribe(0).unwrap();
let mut stdout = tokio::io::stdout();
let join_handle = tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
let _ = stdout
.write_all(format!("- {}\n", res.get_url()).as_bytes())
.await;
}
stdout
});
let start = std::time::Instant::now();
website.crawl().await;
website.unsubscribe();
let duration = start.elapsed();
let mut stdout = join_handle.await.unwrap();
let _ = stdout
.write_all(
format!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
)
.as_bytes(),
)
.await;
}
Full Changelog: v1.97.14...v1.98.3
v1.97.14
Whats Changed
Fix issue with invalid chrome User-Agents when spoofing. If you are using spider like a job use website.with_shared_queue
to make the workload fair across all websites.
- chore(chrome): fix non chrome agents spoofing
- feat(sem): add shared queue strategy
Full Changelog: v1.97.12...v1.97.14
v1.97.12
Whats Changed
- add scoped website semaphore
- add [cowboy] flag to remove semaphore limiting 🤠
- remove
budget
feature flag - fix accidental chrome_intercept type injection compile error
- chore(cli): fix params builder optional handling
- chore(page): add invalid url handling
- chore(website): fix type blacklist compile
Full Changelog: v1.95.25...v1.97.12
v1.96.0
Whats Changed
Fix chrome stealth handling user-agent
- chore(website): fix chrome stealth handling agent
- chore(website): add safe semaphore handling
Full Changelog: v1.95.25...v1.96.0
v1.95.28
Whats Changed
The website crawl status now returns the proper state without reseting.
- chore(website): fix crawl status persisting
Full Changelog: v1.95.25...v1.95.28
v1.95.27
Whats Changed
This release provides a major fix for crawls being delayed by respect robots or crawl delays. If you set a limit or budget for the crawl and a robots.txt contains a delay of 10s this would be a bottleneck for the entire crawl when limits applied since the we would have to wait for each link to process prior to exiting. The robots delay is now maxed at 60s for efficiency.
- chore(cli): fix limit respecting
- chore(robots): fix respect robots [#184]
- bump chromiumoxide@0.6.0
- bump tiktoken-rs@0.5.9
- bump hashbrown@0.14.5
- add zstd support reqwest
- unpin smallvec
- chore(website): fix crawl limit immediate exit
- chore(robots): add max delay respect
Full Changelog: v1.95.6...v1.95.27
v1.95.9
Whats Changed
- chore(openai): fix smart mode passing target url
- chore(js): remove alpha js feature flag - jsdom crate
- chore(chrome): remove unnecessary page activation
- chore(openai): compress base prompt
- chore(openai): remove hidden content from request
Full Changelog: v1.94.4...v1.95.9
v1.94.4
Whats Changed
Using a hybrid cache between chrome CDP Request and HTTP Request can be done using the cache_chrome_hybrid
feature flag.
You can simulate browser http headers to help increase the chance of the request with http using the real_browser
flag.
- feat(cache): add chrome caching between http
- feat(real_browser): add http simulation headers
Full Changelog: v1.93.43...v1.94.4
v1.93.43
Whats Changed
Generating random real user-agents can now be done using ua_generator@0.4.1
.
Spoofing http headers can now be done with the spoof
flag.
Use ua_generator::ua::UserAgents if you need dynamic User-Agent randomizer followed with website.with_user_agent
.
- feat(spoof): add referrer spoofing
- feat(spoof): add real user-agent spoofing
- feat(chrome): add dynamic chrome connections
Full Changelog: v1.93.23...v1.93.43