19 Dec 20:15

j-mendez

6f489ea

v1.80.15

Whats Changed

feat(depth): add crawl depth level control
feat(redirect): add redirect limit expose with server respects
feat(redirect): add redirect policy Loose & Strict
perf(control): add rwlock crawl control

Example:

extern crate spider;

use spider::{tokio, website::Website, configuration::RedirectPolicy};
use std::io::Error;

#[tokio::main]
async fn main() -> Result<(), Error> {
   let mut website = Website::new("https://rsseau.fr")
       .with_depth(3)
       .with_redirect_limit(4)
       .with_redirect_policy(RedirectPolicy::Strict)
       .build()
       .unwrap();

   website.crawl().await;

   let links = website.get_links();

   for link in links {
       println!("- {:?}", link.as_ref());
   }

   println!("Total pages: {:?}", links.len());

   Ok(())
}

Full Changelog: v1.80.3...v1.80.15

Assets 2

16 Dec 18:00

j-mendez

v1.80.3

0fe4e85

v1.80.3

What's Changed

feat(cache): add caching backend feat flag by @j-mendez in #156
chore(chrome_intercept): fix intercept redirect initial domain
perf(chrome_intercept): improve intercept handling of assets

Example:

Make sure to have the feat flag [cache] enabled. Storing cache in memory can be done with the flag [cache_mem] instead of using disk space.

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    // we can use the builder method to enable caching or set `website.cache` to true directly.
    let mut website: Website = Website::new("https://rsseau.fr")
        .with_caching(true)
        .build()
        .unwrap();

    website.crawl().await;

    println!("Links found {:?}", website.get_links().len());
    /// next run to website.crawl().await; will be faster since content is stored on disk.
}

Full Changelog: v1.70.4...v1.80.3

Contributors

j-mendez

Assets 2

15 Dec 00:28

j-mendez

v1.70.4

0f9313a

v1.71.5

Whats Changed

Spider(Core)

Request interception can be done by enabling [chrome_intercept] and setting website.chrome_intercept. This will block all resources that are not related to the domain speeding up the request when using Chrome.

Ex:

//! `cargo run --example chrome --features chrome_intercept`
extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let block_images = true;
    let mut website: Website = Website::new("https://rsseau.fr")
        .with_chrome_intercept(true, block_images)
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("{:?}", page.get_url());
        }
    });

    website.crawl().await;

    println!("Links found {:?}", website.get_links().len());
}

CLI

Request interception can be done using the arg block_images and enabling the [chrome_intercept] feature flag.

Ex: --block_images

Full Changelog: v1.60.12...v1.70.5

Assets 2

12 Dec 16:42

j-mendez

v1.60.13

33953c2

v1.60.13

What's Changed

This release brings a new feature flag (smart), performance improvements, and fixes.

feat(smart): add feat flag smart for smart mode. Default request to HTTP until JavaScript rendering is needed
perf(crawl): add clone external checking
chore(chrome): fix chrome connection socket keep alive on remote connections
feat(chrome_store_page): add feat flag chrome_store_page and screenshot helper
chore(decentralize): fix glob build
feat(redirect): add transparent top redirect handling

Smart Mode

Smart mode brings the best of both worlds when crawling. It runs HTTP request first until JS page Rendering is required with Chrome.

Screenshots

Taking a screenshot manually can be done with the [chrome_store_page] feature flag.

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("Screenshotting: {:?}", page.get_url());
            let full_page = false;
            let omit_background = true;
            page.screenshot(full_page, omit_background).await;
            // output is stored by default to ./storage/ use the env variable SCREENSHOT_DIRECTORY to adjust the path.
        }
    });

    website.crawl().await;

    println!("Links found {:?}", website.get_links().len());
}

Full Changelog: v1.50.20...v1.60.13

Assets 2

04 Dec 18:11

j-mendez

v1.50.20

fdbc211

v1.50.20

Whats Changed

feat(chrome): add chrome_screenshot feature flag
chore(control): fix control task abort after crawl
chore(website): add website.stop handling shutdown

Full Changelog: v1.50.2...v1.50.20

Assets 2

25 Nov 15:19

j-mendez

v1.50.5

c5d085c

v1.50.5

What's Changed

You can now run a cron job at anytime to sync data from the crawls. Use the cron with subscribe to handle data curation with ease.

feat(cron): add cron feature flag by @j-mendez in #153
chore(tls): add optional native tls
feat(napi): add napi support for nodejs

[dependencies]
spider = { version = "1.50.0", features = ["sync", "cron"] }

extern crate spider;

use spider::website::{Website, run_cron};
use spider::tokio;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    // set the cron to run or use the builder pattern `website.with_cron`.
    website.cron_str = "1/5 * * * * *".into();

    let mut rx2 = website.subscribe(16).unwrap();

    let join_handle = tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    // take ownership of the website. You can also use website.run_cron, except you need to perform abort manually on handles created.
    let runner = run_cron(website).await;
    
    println!("Starting the Runner for 10 seconds");
    tokio::time::sleep(tokio::time::Duration::from_secs(10)).await;
    let _ = tokio::join!(runner.stop(), join_handle);
}

Full Changelog: v1.49.10...v1.50.5

Contributors

j-mendez

Assets 2

24 Nov 21:31

j-mendez

v1.49.12

2347cc5

v1.49.12

Whats Changed

feat(cookies): add cookie jar optional feature

You can set a cookie String directly with website.cookie_str that is added for each request. Using the cookie feature also enables storing cookies that are received.

Full Changelog: v1.49.10...v1.49.12

Assets 2

20 Nov 17:45

j-mendez

v1.49.10

0f1ff31

v1.49.10

Whats Changed

chore(chrome): fix chrome headless headful args
chore(cli): add http check cli website url
chore(cli): rename domain arg - url [#150]
chore(cli): add invalid website error log
Return status code on error by @marlonbaeten in #151
chore(chrome): add main chromiumoxide crate - ( fork changes merged to the main repo )
chore(chrome): fix headful browser open
chore(website): add crawl_concurrent_raw method by @j-mendez in #152
chore(deps): bump tokio@1.34.0

Thank you @marlonbaeten for help!

Full Changelog: v1.48.0...v1.49.10

Contributors

marlonbaeten and j-mendez

Assets 2

13 Nov 16:16

j-mendez

v1.48.0

17f1cd0

v1.48.0

What's Changed

feat(page): add status code and error message page response by @j-mendez in #148
chore(scraper): add ignore scripts and styles when text extracting nodes

Full Changelog: v1.46.5...v1.48.0

Contributors

j-mendez

Assets 2

28 Oct 17:52

j-mendez

v1.46.5

c14cd6c

v1.46.5

What's Changed

chore(page): fix subdomain entry point handling root by @j-mendez in #146

Full Changelog: v1.46.4...v1.46.5

Contributors

j-mendez

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whats Changed

What's Changed

Contributors

Whats Changed

Spider(Core)

CLI

What's Changed

Smart Mode

Screenshots

Whats Changed

What's Changed

Contributors

Whats Changed

Whats Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: spider-rs/spider

v1.80.15

Whats Changed

v1.80.3

What's Changed

Contributors

v1.71.5

Whats Changed

Spider(Core)

CLI

v1.60.13

What's Changed

Smart Mode

Screenshots

v1.50.20

Whats Changed

v1.50.5

What's Changed

Contributors

v1.49.12

Whats Changed

v1.49.10

Whats Changed

Contributors

v1.48.0

What's Changed

Contributors

v1.46.5

What's Changed

Contributors