23 Apr 11:08

j-mendez

21e4013

v1.93.13

Whats Changed

Updated crate compatibility with reqwest@0.12.4 and fixed headers compile for worker.
Remove http3 feature flag - follow the unstable instructions if needed.

The function website.get_domain renamed to website.get_url.
The function website.get_domain_parsed renamed to website.get_url_parsed.

chore(worker): fix headers flag compile
chore(crates): update async-openai@0.20.0
chore(openai): trim start messages content output text
chore(website): fix url getter function name

Full Changelog: v1.93.3...v1.93.13

Assets 2

14 Apr 21:20

j-mendez

v1.93.3

0dd646e

v1.93.3

Whats Changed

You can now take screenshots per step when using OpenAI to manipulate the page.
Connecting to a proxy on chrome headless remote is now fixed.

feat(openai): add screenshot js execution after effects
feat(openai): add deserialization error determination
chore(chrome): fix proxy server headless connecting

    use spider::configuration::GPTConfigs;
  
    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec!["Search for Movies", "Extract the hrefs found."],
        3000,
    );

    gpt_config.screenshot = true;
    gpt_config.set_extra(true);

Full Changelog: v1.92.0...v1.93.3

Assets 2

13 Apr 17:30

j-mendez

v1.92.0

5400f8a

v1.92.0

What's Changed

Caching OpenAI responses can now be done using the 'cache_openai' flag and a builder method.

docs: fix broken glob url link by @emilsivervik in #179
feat(openai): add response caching

Example

extern crate spider;

use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::moka::future::Cache;
use spider::tokio;
use spider::website::Website;
use std::time::Duration;

#[tokio::main]
async fn main() {
    let cache = Cache::builder()
        .time_to_live(Duration::from_secs(30 * 60))
        .time_to_idle(Duration::from_secs(5 * 60))
        .max_capacity(10_000)
        .build();

    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi_cache(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
        Some(cache),
    );
    gpt_config.set_extra(true);

    let mut website: Website = Website::new("https://www.google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_limit(1)
        .with_openai(Some(gpt_config))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    let handle = tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("---\n{}\n{:?}\n{:?}\n---", page.get_url(), page.openai_credits_used, page.extra_ai_data);
        }
    });

    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();
    let links = website.get_links();

    println!(
        "(0) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );

    // crawl the page again to see if cache is re-used.
    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    website.unsubscribe();

    let _ = handle.await;

    println!(
        "(1) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );
}

New Contributors

@emilsivervik made their first contribution in #179

Full Changelog: v.1.91.1...v1.92.0

Contributors

emilsivervik

Assets 2

29 Mar 01:01

j-mendez

v1.90.0

28db8f9

v1.90.0

Whats Changed

RSS feeds handled automatically on crawls.

feat(rss): add rss support
chore(openai): fix compile chrome flag
chore(crate): remove serde pin
chore(website): fix sitemap chrome build
chore(crate): remove pins on common crates ( reduces build size )
chore(openai): fix prompt deserialization
chore(openai): add custom api key config

Full Changelog: v1.89.0...v1.90.0

Assets 2

10 Apr 09:52

j-mendez

v.1.91.1

19ed6a7

v.1.91.1

Whats Changed

The AI results now return the input(prompt), js_ouput, and content_output.

Full Changelog: v1.90.0...v.1.91.1

Assets 2

26 Mar 16:50

j-mendez

v1.88.7

0cc921e

v1.88.7

Whats Changed

You can now drive the browser with multi steps towards a goal, example. Extracting content or gathering extra data can be done as well using GPTConfigs.extra_ai_data.

The credits used can be checked with Page.openai_credits_used.

chore(page): return all page content regardless of status
chore(openai): fix svg removal
feat(openai): add extra data gpt curating
chore(openai): add credits used response
feat(fingerprint): add fingerprint id configuration

use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::website::Website;

#[tokio::main]
async fn main() {
    let gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
    );
    
   let mut website: Website = Website::new("https://www.google.com")
        .with_openai(Some(gpt_config))
        .with_limit(1)
        .build()
        .unwrap();
        
     website.crawl().await;
}

Full Changelog: v1.87.3...v1.88.7

Assets 2

25 Mar 18:44

j-mendez

v1.87.3

efa6a16

v1.87.3

Whats Changed

You can now bypass Cloudflare protected page with the feature flag [real_browser].

feat(real_browser): add real_browser feature flag for chrome

Full Changelog: v1.86.16...v1.87.3

Assets 2

19 Mar 17:04

j-mendez

v1.86.16

9b7a90f

v1.86.16

What's Changed

You can now dynamically drive the browser with custom scripts using OpenAI.
Make sure to set the OPENAI_API_KEY env variable or pass it in to the program.

Openai/chrome driver by @j-mendez in #174
chore(page): add cold fusion file crawling support

extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let _ = tokio::fs::create_dir_all("./storage/").await;

    let screenshot_params =
        spider::configuration::ScreenshotParams::new(Default::default(), Some(true), Some(true));
    let screenshot_config =
        spider::configuration::ScreenShotConfig::new(screenshot_params, true, true, None);

    let mut website: Website = Website::new("https://google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_screenshot(Some(screenshot_config))
        .with_limit(1)
        .with_openai(Some(GPTConfigs::new(
            "gpt-4-1106-preview",
            "Search for Movies",
            500,
        )))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("{}\n{}", page.get_url(), page.get_html());
        }
    });

    website.crawl().await;
}

The output of the custom script from the AI:

The screenshot of the page output:

Full Changelog: v1.85.4...v1.86.16

Contributors

j-mendez

Assets 2

11 Mar 18:50

j-mendez

v1.85.4

5322150

v1.85.4

Whats Changed

You can now update the crawl links outside of the context by using website.queue to get a sender.

feat(q): add mid crawl queue
chore(chrome): fix semaphore limiting scrape

use spider::tokio;
use spider::url::Url;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();
    let mut g = website.subscribe_guard().unwrap();
    let q = website.queue(100).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let u = res.get_url();
            println!("{:?}", u);
            let mut url = Url::parse(u).expect("Failed to parse URL");

            let mut segments: Vec<_> = url
                .path_segments()
                .map(|c| c.collect::<Vec<_>>())
                .unwrap_or_else(Vec::new);

            if segments.len() > 0 && segments[0] == "en" {
                segments[0] = "fr";
                let new_path = segments.join("/");
                url.set_path(&new_path);
                // get a new url here or perform an action and queue links
                // pre-fetch all fr locales
                let _ = q.send(url.into());
            }
            g.inc();
        }
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        website.get_links().len()
    )
}

Thanks @oiwn

Full Changelog: v1.84.11...v1.85.4

Contributors

oiwn

Assets 2

09 Mar 15:20

j-mendez

v1.84.11

ed80677

v1.84.11

Whats Changed

You can now pre-set links to crawl or extend using website.set_extra_links.

chore(website): add set extra links extended crawls

@oiwn thanks for the help!

Full Changelog: v1.84.9...v1.84.11

Contributors

oiwn

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whats Changed

Whats Changed

What's Changed

Example

New Contributors

Contributors

Whats Changed

Whats Changed

Whats Changed

Whats Changed

What's Changed

Contributors

Whats Changed

Contributors

Whats Changed

Contributors

Releases: spider-rs/spider

v1.93.13

Whats Changed

v1.93.3

Whats Changed

v1.92.0

What's Changed

Example

New Contributors

Contributors

v1.90.0

Whats Changed

v.1.91.1

Whats Changed

v1.88.7

Whats Changed

v1.87.3

Whats Changed

v1.86.16

What's Changed

Contributors

v1.85.4

Whats Changed

Contributors

v1.84.11

Whats Changed

Contributors