Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(website): add crawl_concurrent_raw method #152

Merged
merged 1 commit into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions examples/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_examples"
version = "1.49.6"
version = "1.49.7"
authors = ["madeindjs <contact@rousseau-alexandre.fr>", "j-mendez <jeff@a11ywatch.com>"]
description = "Multithreaded web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -22,7 +22,7 @@ htr = "0.5.27"
flexbuffers = "2.0.0"

[dependencies.spider]
version = "1.49.6"
version = "1.49.7"
path = "../spider"
features = ["serde"]

Expand Down
2 changes: 1 addition & 1 deletion spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.49.6"
version = "1.49.7"
authors = ["madeindjs <contact@rousseau-alexandre.fr>", "j-mendez <jeff@a11ywatch.com>"]
description = "The fastest web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand Down
21 changes: 15 additions & 6 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.49.6"
spider = "1.49.7"
```

And then the code:
Expand Down Expand Up @@ -87,7 +87,7 @@ We have a couple optional feature flags. Regex blacklisting, jemaloc backend, gl

```toml
[dependencies]
spider = { version = "1.49.6", features = ["regex", "ua_generator"] }
spider = { version = "1.49.7", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand Down Expand Up @@ -116,7 +116,7 @@ Move processing to a worker, drastically increases performance even if worker is

```toml
[dependencies]
spider = { version = "1.49.6", features = ["decentralized"] }
spider = { version = "1.49.7", features = ["decentralized"] }
```

```sh
Expand All @@ -136,7 +136,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.49.6", features = ["sync"] }
spider = { version = "1.49.7", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -166,7 +166,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.49.6", features = ["regex"] }
spider = { version = "1.49.7", features = ["regex"] }
```

```rust,no_run
Expand All @@ -193,7 +193,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.49.6", features = ["control"] }
spider = { version = "1.49.7", features = ["control"] }
```

```rust
Expand Down Expand Up @@ -257,6 +257,15 @@ async fn main() {
}
```

### Chrome

```toml
[dependencies]
spider = { version = "1.49.7", features = ["chrome"] }
```

You can use `website.crawl_concurrent_raw` to perform a crawl without chromium when needed. Use the feature flag `chrome_headed` to enable headful browser usage if needed to debug.

### Blocking

If you need a blocking sync implementation use a version prior to `v1.12.0`.
15 changes: 7 additions & 8 deletions spider/src/page.rs
Original file line number Diff line number Diff line change
Expand Up @@ -197,14 +197,6 @@ pub fn build(_: &str, res: PageResponse) -> Page {
}

impl Page {
#[cfg(all(not(feature = "decentralized"), feature = "chrome"))]
/// Instantiate a new page and gather the html.
pub async fn new(url: &str, client: &Client, page: &chromiumoxide::Page) -> Self {
let page_resource = crate::utils::fetch_page_html(&url, &client, &page).await;
build(url, page_resource)
}

#[cfg(not(feature = "decentralized"))]
/// Instantiate a new page and gather the html repro of standard fetch_page_html.
pub async fn new_page(url: &str, client: &Client) -> Self {
let page_resource = crate::utils::fetch_page_html_raw(&url, &client).await;
Expand All @@ -218,6 +210,13 @@ impl Page {
build(url, page_resource)
}

#[cfg(all(not(feature = "decentralized"), feature = "chrome"))]
/// Instantiate a new page and gather the html.
pub async fn new(url: &str, client: &Client, page: &chromiumoxide::Page) -> Self {
let page_resource = crate::utils::fetch_page_html(&url, &client, &page).await;
build(url, page_resource)
}

/// Instantiate a new page and gather the links.
#[cfg(feature = "decentralized")]
pub async fn new(url: &str, client: &Client) -> Self {
Expand Down
Loading