Skip to content

Releases: spider-rs/spider

v2.10.6

22 Oct 10:47
Compare
Choose a tag to compare

Whats Changed

  1. add html lang auto encoding handling to improve detection
  2. add exclude_selector and root_selector transformations output formats
  3. add bin file handling to prevent SOF transformations
  4. chore(chrome): fix window navigator stealth handling
  5. chore: fix subdomains and tld handling
  6. chore(chrome): add automation all routes handling

Full Changelog: v2.9.15...v2.10.6

v2.9.15

09 Oct 13:50
Compare
Choose a tag to compare

Whats Changed

  • add XPath data extraction support spider_utils
  • add XML return format for spider_transformations
  • chore(transformations): add root selector across formats #219
    Example getting data via xpath.
    let map = QueryCSSMap::from([(
        "list",
        QueryCSSSelectSet::from(["//*[@class='list']"]),
    )]);
    let data = css_query_select_map_streamed(
        r#"<html><body><ul class="list"><li>Test</li></ul></body></html>"#,
        &build_selectors(map),
    )
    .await;

    assert!(!data.is_empty(), "Xpath extraction failed");

Full Changelog: v2.8.28...v2.9.15

v2.8.29

05 Oct 14:55
Compare
Choose a tag to compare

Whats Changed

Fix request interception remote connections. Intercept builder now uses spider::features::chrome_common::RequestInterceptConfiguration and adds more control.

  • chrome performance improvement reducing dup events
  • chore(chrome): add set extra headers
  • chore(smart): add http fallback chrome smart mode request
  • chore(chrome): add spoofed plugins
  • chore(real-browser): add mouse movement waf
  • chore(chrome): patch logs stealth mode
  • chore(page): fix url join empty slash
  • chore(chrome): fix return page response headers and cookies
  • chore(page): add empty page validation
  • chore(config): add serializable crawl configuration
  • chore(retry): add check 502 notfound retry

Full Changelog: v2.7.1...v2.8.29

v2.7.1

30 Sep 23:14
Compare
Choose a tag to compare

Whats Changed

  • add chrome remote connection proxy ability.
  • add context handling and disposing chrome.
  • chore(chrome): fix concurrent pages opening remote ws connections
  • chore(chrome): add cookie setting browser
  • chore(chrome): fix connecting to chrome when using a LB
  • feat(website): add retry and rate limiting handling

Full Changelog: v2.6.15...v2.7.1

v2.6.15

22 Sep 00:27
Compare
Choose a tag to compare
  • fix parsing links for top level redirected domains
  • add website.with_preserve_host_header
  • default tls reqwest_native_tls_native_roots

Full Changelog: v2.5.2...v2.6.15

HTML Transformations

21 Sep 12:29
Compare
Choose a tag to compare

Whats Changed

We Open Sourced our transformation utils for Spider cloud that provides high performance output to markdown, text, and other formats.

You can install spider_transformations on it's own or use the feature flag transformations when installing spider_utils.

use spider::tokio;
use spider::website::Website;
use spider_utils::spider_transformations::transformation::content::{
    transform_content, ReturnFormat, TransformConfig,
};
use tokio::io::AsyncWriteExt;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2: tokio::sync::broadcast::Receiver<spider::page::Page> =
        website.subscribe(0).unwrap();
    let mut stdout = tokio::io::stdout();

    let mut conf = TransformConfig::default();
    conf.return_format = ReturnFormat::Markdown;

    let join_handle = tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let markup = transform_content(&res, &conf, &None, &None);

            let _ = stdout
                .write_all(format!("- {}\n {}\n", res.get_url(), markup).as_bytes())
                .await;
        }
        stdout
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    website.unsubscribe();
    let duration = start.elapsed();
    let mut stdout = join_handle.await.unwrap();

    let _ = stdout
        .write_all(
            format!(
                "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
                duration,
                website.get_links().len()
            )
            .as_bytes(),
        )
        .await;
}

Full Changelog: v2.5.2...v2.6.2

v2.5.3

14 Sep 17:08
Compare
Choose a tag to compare

Whats Changed

  1. Add visited links string interning performance increase on look up and memory reduction on store. #204

Full Changelog: v2.4.1...v2.5.3

v2.4.1

09 Sep 16:39
Compare
Choose a tag to compare

Whats Changed

The screenshot performance has drastically increased by taking advantage of chromes params to handle full_screen without re-adjusting the layout and the optimize_for_speed param. This works well the concurrent interception handling to avoid stalling on re-layout. If you use the crawler to take screenshots it is recommended to upgrade.

  • perf(chrome): add major screenshot performance custom command
  • chore(utils): add trie match all base path
  • chore(examples): add css scraping example

Full Changelog: v2.3.5...v2.4.1

v2.3.5

08 Sep 07:20
Compare
Choose a tag to compare

Whats Changed

Major performance improvement on chrome enabling concurrent request interception for resource heavy pages.

  • add response headers when chrome is used
  • add hybrid cache response and headers chrome
  • fix chrome sub pages setup
  • perf(chrome): add concurrent request interception

Full Changelog: v2.2.18...v2.3.5

v2.2.18

29 Aug 02:03
Compare
Choose a tag to compare

Whats Changed

We can now auto detect locales without losing out on performance. We default enabled the encoding flag for this change!

  • get_html now properly encodes the HTML instead of UTF8 default encoding
  • bump chromiumoxide@0.7.0
  • fix chrome hang on ws connections handler
  • fix fetch stream infinite loop on error
  • fix chrome frame setting url ( this temp prevents hybrid caching from having the req/res for the page )
let mut website: Website = Website::new("https://tenki.jp");
// all of the content output has the proper encoding automatically

Full Changelog: v2.1.9...v2.2.18