Skip to content

Releases: spider-rs/spider

v1.98.4

02 Jul 13:16
Compare
Choose a tag to compare

Whats Changed

You can now set the max bytes limit for resources by using the env variable SPIDER_MAX_SIZE_BYTES.

example for 1gb limit:

export SPIDER_MAX_SIZE_BYTES=1073741824

Full Changelog: v1.98.3...v1.98.4

v1.98.3

19 Jun 12:24
Compare
Choose a tag to compare

Whats Changed

  • Fix sitemap regex compile.
  • Whitelisting routes to only get the paths you want can now be done with website.with_whitelist_url example:
  • Fix robot disallow respect
use spider::{tokio, website::Website};
use tokio::io::AsyncWriteExt;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr/en/");

    website.with_whitelist_url(Some(vec!["/books".into()]));

    let mut rx2: tokio::sync::broadcast::Receiver<spider::page::Page> =
        website.subscribe(0).unwrap();
    let mut stdout = tokio::io::stdout();

    let join_handle = tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let _ = stdout
                .write_all(format!("- {}\n", res.get_url()).as_bytes())
                .await;
        }
        stdout
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    website.unsubscribe();
    let duration = start.elapsed();
    let mut stdout = join_handle.await.unwrap();

    let _ = stdout
        .write_all(
            format!(
                "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
                duration,
                website.get_links().len()
            )
            .as_bytes(),
        )
        .await;
}

Full Changelog: v1.97.14...v1.98.3

v1.97.14

13 Jun 15:01
Compare
Choose a tag to compare

Whats Changed

Fix issue with invalid chrome User-Agents when spoofing. If you are using spider like a job use website.with_shared_queue to make the workload fair across all websites.

  • chore(chrome): fix non chrome agents spoofing
  • feat(sem): add shared queue strategy

Full Changelog: v1.97.12...v1.97.14

v1.97.12

04 Jun 18:44
Compare
Choose a tag to compare

Whats Changed

  • add scoped website semaphore
  • add [cowboy] flag to remove semaphore limiting 🤠
  • remove budget feature flag
  • fix accidental chrome_intercept type injection compile error
  • chore(cli): fix params builder optional handling
  • chore(page): add invalid url handling
  • chore(website): fix type blacklist compile

Full Changelog: v1.95.25...v1.97.12

v1.96.0

04 Jun 14:25
Compare
Choose a tag to compare

Whats Changed

Fix chrome stealth handling user-agent

  1. chore(website): fix chrome stealth handling agent
  2. chore(website): add safe semaphore handling

Full Changelog: v1.95.25...v1.96.0

v1.95.28

01 Jun 12:09
Compare
Choose a tag to compare

Whats Changed

The website crawl status now returns the proper state without reseting.

  1. chore(website): fix crawl status persisting

Full Changelog: v1.95.25...v1.95.28

v1.95.27

28 May 15:33
Compare
Choose a tag to compare

Whats Changed

This release provides a major fix for crawls being delayed by respect robots or crawl delays. If you set a limit or budget for the crawl and a robots.txt contains a delay of 10s this would be a bottleneck for the entire crawl when limits applied since the we would have to wait for each link to process prior to exiting. The robots delay is now maxed at 60s for efficiency.

  • chore(cli): fix limit respecting
  • chore(robots): fix respect robots [#184]
  • bump chromiumoxide@0.6.0
  • bump tiktoken-rs@0.5.9
  • bump hashbrown@0.14.5
  • add zstd support reqwest
  • unpin smallvec
  • chore(website): fix crawl limit immediate exit
  • chore(robots): add max delay respect

Full Changelog: v1.95.6...v1.95.27

v1.95.9

18 May 19:05
Compare
Choose a tag to compare

Whats Changed

  1. chore(openai): fix smart mode passing target url
  2. chore(js): remove alpha js feature flag - jsdom crate
  3. chore(chrome): remove unnecessary page activation
  4. chore(openai): compress base prompt
  5. chore(openai): remove hidden content from request

Full Changelog: v1.94.4...v1.95.9

v1.94.4

09 May 16:44
Compare
Choose a tag to compare

Whats Changed

Using a hybrid cache between chrome CDP Request and HTTP Request can be done using the cache_chrome_hybrid feature flag.
You can simulate browser http headers to help increase the chance of the request with http using the real_browser flag.

  1. feat(cache): add chrome caching between http
  2. feat(real_browser): add http simulation headers

Full Changelog: v1.93.43...v1.94.4

v1.93.43

03 May 18:35
Compare
Choose a tag to compare

Whats Changed

Generating random real user-agents can now be done using ua_generator@0.4.1.
Spoofing http headers can now be done with the spoof flag.

Use ua_generator::ua::UserAgents if you need dynamic User-Agent randomizer followed with website.with_user_agent.

  • feat(spoof): add referrer spoofing
  • feat(spoof): add real user-agent spoofing
  • feat(chrome): add dynamic chrome connections

Full Changelog: v1.93.23...v1.93.43