Skip to content

Releases: spider-rs/spider

v1.80.15

19 Dec 20:15
Compare
Choose a tag to compare

Whats Changed

  • feat(depth): add crawl depth level control
  • feat(redirect): add redirect limit expose with server respects
  • feat(redirect): add redirect policy Loose & Strict
  • perf(control): add rwlock crawl control

Example:

extern crate spider;

use spider::{tokio, website::Website, configuration::RedirectPolicy};
use std::io::Error;

#[tokio::main]
async fn main() -> Result<(), Error> {
   let mut website = Website::new("https://rsseau.fr")
       .with_depth(3)
       .with_redirect_limit(4)
       .with_redirect_policy(RedirectPolicy::Strict)
       .build()
       .unwrap();

   website.crawl().await;

   let links = website.get_links();

   for link in links {
       println!("- {:?}", link.as_ref());
   }

   println!("Total pages: {:?}", links.len());

   Ok(())
}

Full Changelog: v1.80.3...v1.80.15

v1.80.3

16 Dec 18:00
Compare
Choose a tag to compare

What's Changed

  • feat(cache): add caching backend feat flag by @j-mendez in #156
  • chore(chrome_intercept): fix intercept redirect initial domain
  • perf(chrome_intercept): improve intercept handling of assets

Example:

Make sure to have the feat flag [cache] enabled. Storing cache in memory can be done with the flag [cache_mem] instead of using disk space.

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    // we can use the builder method to enable caching or set `website.cache` to true directly.
    let mut website: Website = Website::new("https://rsseau.fr")
        .with_caching(true)
        .build()
        .unwrap();

    website.crawl().await;

    println!("Links found {:?}", website.get_links().len());
    /// next run to website.crawl().await; will be faster since content is stored on disk.
}

Full Changelog: v1.70.4...v1.80.3

v1.71.5

15 Dec 00:28
Compare
Choose a tag to compare

Whats Changed

Spider(Core)

Request interception can be done by enabling [chrome_intercept] and setting website.chrome_intercept. This will block all resources that are not related to the domain speeding up the request when using Chrome.

Ex:

//! `cargo run --example chrome --features chrome_intercept`
extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let block_images = true;
    let mut website: Website = Website::new("https://rsseau.fr")
        .with_chrome_intercept(true, block_images)
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("{:?}", page.get_url());
        }
    });

    website.crawl().await;

    println!("Links found {:?}", website.get_links().len());
}

CLI

Request interception can be done using the arg block_images and enabling the [chrome_intercept] feature flag.

Ex: --block_images

Full Changelog: v1.60.12...v1.70.5

v1.60.13

12 Dec 16:42
Compare
Choose a tag to compare

What's Changed

This release brings a new feature flag (smart), performance improvements, and fixes.

  • feat(smart): add feat flag smart for smart mode. Default request to HTTP until JavaScript rendering is needed
  • perf(crawl): add clone external checking
  • chore(chrome): fix chrome connection socket keep alive on remote connections
  • feat(chrome_store_page): add feat flag chrome_store_page and screenshot helper
  • chore(decentralize): fix glob build
  • feat(redirect): add transparent top redirect handling

Smart Mode

Smart mode brings the best of both worlds when crawling. It runs HTTP request first until JS page Rendering is required with Chrome.

Screenshots

Taking a screenshot manually can be done with the [chrome_store_page] feature flag.

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("Screenshotting: {:?}", page.get_url());
            let full_page = false;
            let omit_background = true;
            page.screenshot(full_page, omit_background).await;
            // output is stored by default to ./storage/ use the env variable SCREENSHOT_DIRECTORY to adjust the path.
        }
    });

    website.crawl().await;

    println!("Links found {:?}", website.get_links().len());
}

Full Changelog: v1.50.20...v1.60.13

v1.50.20

04 Dec 18:11
Compare
Choose a tag to compare

Whats Changed

  • feat(chrome): add chrome_screenshot feature flag
  • chore(control): fix control task abort after crawl
  • chore(website): add website.stop handling shutdown

Full Changelog: v1.50.2...v1.50.20

v1.50.5

25 Nov 15:19
Compare
Choose a tag to compare

What's Changed

You can now run a cron job at anytime to sync data from the crawls. Use the cron with subscribe to handle data curation with ease.

  • feat(cron): add cron feature flag by @j-mendez in #153
  • chore(tls): add optional native tls
  • feat(napi): add napi support for nodejs
[dependencies]
spider = { version = "1.50.0", features = ["sync", "cron"] }
extern crate spider;

use spider::website::{Website, run_cron};
use spider::tokio;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    // set the cron to run or use the builder pattern `website.with_cron`.
    website.cron_str = "1/5 * * * * *".into();

    let mut rx2 = website.subscribe(16).unwrap();

    let join_handle = tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    // take ownership of the website. You can also use website.run_cron, except you need to perform abort manually on handles created.
    let runner = run_cron(website).await;
    
    println!("Starting the Runner for 10 seconds");
    tokio::time::sleep(tokio::time::Duration::from_secs(10)).await;
    let _ = tokio::join!(runner.stop(), join_handle);
}

Full Changelog: v1.49.10...v1.50.5

v1.49.12

24 Nov 21:31
Compare
Choose a tag to compare

Whats Changed

  • feat(cookies): add cookie jar optional feature

You can set a cookie String directly with website.cookie_str that is added for each request. Using the cookie feature also enables storing cookies that are received.

Full Changelog: v1.49.10...v1.49.12

v1.49.10

20 Nov 17:45
Compare
Choose a tag to compare

Whats Changed

  • chore(chrome): fix chrome headless headful args
  • chore(cli): add http check cli website url
  • chore(cli): rename domain arg - url [#150]
  • chore(cli): add invalid website error log
  • Return status code on error by @marlonbaeten in #151
  • chore(chrome): add main chromiumoxide crate - ( fork changes merged to the main repo )
  • chore(chrome): fix headful browser open
  • chore(website): add crawl_concurrent_raw method by @j-mendez in #152
  • chore(deps): bump tokio@1.34.0

Thank you @marlonbaeten for help!

Full Changelog: v1.48.0...v1.49.10

v1.48.0

13 Nov 16:16
17f1cd0
Compare
Choose a tag to compare

What's Changed

  • feat(page): add status code and error message page response by @j-mendez in #148
  • chore(scraper): add ignore scripts and styles when text extracting nodes

Full Changelog: v1.46.5...v1.48.0

v1.46.5

28 Oct 17:52
c14cd6c
Compare
Choose a tag to compare

What's Changed

  • chore(page): fix subdomain entry point handling root by @j-mendez in #146

Full Changelog: v1.46.4...v1.46.5