Releases: spider-rs/spider
v2.10.6
Whats Changed
- add html lang auto encoding handling to improve detection
- add
exclude_selector
androot_selector
transformations output formats - add bin file handling to prevent SOF transformations
- chore(chrome): fix window navigator stealth handling
- chore: fix subdomains and tld handling
- chore(chrome): add automation all routes handling
Full Changelog: v2.9.15...v2.10.6
v2.9.15
Whats Changed
- add XPath data extraction support
spider_utils
- add XML return format for
spider_transformations
- chore(transformations): add root selector across formats #219
Example getting data via xpath.
let map = QueryCSSMap::from([(
"list",
QueryCSSSelectSet::from(["//*[@class='list']"]),
)]);
let data = css_query_select_map_streamed(
r#"<html><body><ul class="list"><li>Test</li></ul></body></html>"#,
&build_selectors(map),
)
.await;
assert!(!data.is_empty(), "Xpath extraction failed");
Full Changelog: v2.8.28...v2.9.15
v2.8.29
Whats Changed
Fix request interception remote connections. Intercept builder now uses spider::features::chrome_common::RequestInterceptConfiguration
and adds more control.
- chrome performance improvement reducing dup events
- chore(chrome): add set extra headers
- chore(smart): add http fallback chrome smart mode request
- chore(chrome): add spoofed plugins
- chore(real-browser): add mouse movement waf
- chore(chrome): patch logs stealth mode
- chore(page): fix url join empty slash
- chore(chrome): fix return page response headers and cookies
- chore(page): add empty page validation
- chore(config): add serializable crawl configuration
- chore(retry): add check 502 notfound retry
Full Changelog: v2.7.1...v2.8.29
v2.7.1
Whats Changed
- add chrome remote connection proxy ability.
- add context handling and disposing chrome.
- chore(chrome): fix concurrent pages opening remote ws connections
- chore(chrome): add cookie setting browser
- chore(chrome): fix connecting to chrome when using a LB
- feat(website): add retry and rate limiting handling
Full Changelog: v2.6.15...v2.7.1
v2.6.15
- fix parsing links for top level redirected domains
- add website.with_preserve_host_header
- default tls reqwest_native_tls_native_roots
Full Changelog: v2.5.2...v2.6.15
HTML Transformations
Whats Changed
We Open Sourced our transformation utils for Spider cloud that provides high performance output to markdown, text, and other formats.
You can install spider_transformations
on it's own or use the feature flag transformations
when installing spider_utils
.
use spider::tokio;
use spider::website::Website;
use spider_utils::spider_transformations::transformation::content::{
transform_content, ReturnFormat, TransformConfig,
};
use tokio::io::AsyncWriteExt;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
let mut rx2: tokio::sync::broadcast::Receiver<spider::page::Page> =
website.subscribe(0).unwrap();
let mut stdout = tokio::io::stdout();
let mut conf = TransformConfig::default();
conf.return_format = ReturnFormat::Markdown;
let join_handle = tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
let markup = transform_content(&res, &conf, &None, &None);
let _ = stdout
.write_all(format!("- {}\n {}\n", res.get_url(), markup).as_bytes())
.await;
}
stdout
});
let start = std::time::Instant::now();
website.crawl().await;
website.unsubscribe();
let duration = start.elapsed();
let mut stdout = join_handle.await.unwrap();
let _ = stdout
.write_all(
format!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
)
.as_bytes(),
)
.await;
}
Full Changelog: v2.5.2...v2.6.2
v2.5.3
v2.4.1
Whats Changed
The screenshot performance has drastically increased by taking advantage of chromes params to handle full_screen
without re-adjusting the layout and the optimize_for_speed
param. This works well the concurrent interception handling to avoid stalling on re-layout. If you use the crawler to take screenshots it is recommended to upgrade.
- perf(chrome): add major screenshot performance custom command
- chore(utils): add trie match all base path
- chore(examples): add css scraping example
Full Changelog: v2.3.5...v2.4.1
v2.3.5
Whats Changed
Major performance improvement on chrome enabling concurrent request interception for resource heavy pages.
- add response headers when chrome is used
- add hybrid cache response and headers chrome
- fix chrome sub pages setup
- perf(chrome): add concurrent request interception
Full Changelog: v2.2.18...v2.3.5
v2.2.18
Whats Changed
We can now auto detect locales without losing out on performance. We default enabled the encoding
flag for this change!
- get_html now properly encodes the HTML instead of UTF8 default encoding
- bump chromiumoxide@0.7.0
- fix chrome hang on ws connections handler
- fix fetch stream infinite loop on error
- fix chrome frame setting url ( this temp prevents hybrid caching from having the req/res for the page )
let mut website: Website = Website::new("https://tenki.jp");
// all of the content output has the proper encoding automatically
Full Changelog: v2.1.9...v2.2.18