Skip to content

Commit

Permalink
feat(cron): add cron feature flag (#153)
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Nov 25, 2023
1 parent 16c796a commit 2996280
Show file tree
Hide file tree
Showing 10 changed files with 564 additions and 20 deletions.
22 changes: 18 additions & 4 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions examples/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_examples"
version = "1.49.13"
version = "1.50.0"
authors = ["madeindjs <contact@rousseau-alexandre.fr>", "j-mendez <jeff@a11ywatch.com>"]
description = "Multithreaded web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -22,7 +22,7 @@ htr = "0.5.27"
flexbuffers = "2.0.0"

[dependencies.spider]
version = "1.49.13"
version = "1.50.0"
path = "../spider"
features = ["serde"]

Expand Down
10 changes: 7 additions & 3 deletions spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.49.13"
version = "1.50.0"
authors = ["madeindjs <contact@rousseau-alexandre.fr>", "j-mendez <jeff@a11ywatch.com>"]
description = "The fastest web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand Down Expand Up @@ -43,12 +43,15 @@ case_insensitive_string = { version = "0.1.7", features = [ "compact", "serde" ]
jsdom = { version = "0.0.11-alpha.1", optional = true, features = [ "hashbrown", "tokio" ] }
chromiumoxide = { version = "0.5.6", optional = true, features = ["tokio-runtime", "bytes"], default-features = false }
sitemap = { version = "0.4.1", optional = true }
chrono = "0.4.31"
cron = "0.12.0"
async-trait = "0.1.74"

[target.'cfg(all(not(windows), not(target_os = "android"), not(target_env = "musl")))'.dependencies]
tikv-jemallocator = { version = "0.5.0", optional = true }

[features]
default = ["sync"]
default = ["sync", "cron"]
regex = ["dep:regex"]
glob = ["dep:regex", "dep:itertools"]
ua_generator = ["dep:ua_generator"]
Expand All @@ -70,4 +73,5 @@ chrome = ["dep:chromiumoxide"]
chrome_headed = ["chrome"]
chrome_cpu = ["chrome"]
chrome_stealth = ["chrome"]
cookies = ["reqwest/cookies"]
cookies = ["reqwest/cookies"]
cron = []
52 changes: 45 additions & 7 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.49.13"
spider = "1.50.0"
```

And then the code:
Expand Down Expand Up @@ -87,7 +87,7 @@ We have a couple optional feature flags. Regex blacklisting, jemaloc backend, gl

```toml
[dependencies]
spider = { version = "1.49.13", features = ["regex", "ua_generator"] }
spider = { version = "1.50.0", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand Down Expand Up @@ -117,7 +117,7 @@ Move processing to a worker, drastically increases performance even if worker is

```toml
[dependencies]
spider = { version = "1.49.13", features = ["decentralized"] }
spider = { version = "1.50.0", features = ["decentralized"] }
```

```sh
Expand All @@ -137,7 +137,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.49.13", features = ["sync"] }
spider = { version = "1.50.0", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -167,7 +167,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.49.13", features = ["regex"] }
spider = { version = "1.50.0", features = ["regex"] }
```

```rust,no_run
Expand All @@ -194,7 +194,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.49.13", features = ["control"] }
spider = { version = "1.50.0", features = ["control"] }
```

```rust
Expand Down Expand Up @@ -258,11 +258,49 @@ async fn main() {
}
```

### Cron Jobs

Use cron jobs to run crawls continuously at anytime.

```toml
[dependencies]
spider = { version = "1.50.0", features = ["sync", "cron"] }
```

```rust,no_run
extern crate spider;
use spider::website::{Website, run_cron};
use spider::tokio;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://choosealicense.com");
// set the cron to run or use the builder pattern `website.with_cron`.
website.cron_str = "1/5 * * * * *".into();
let mut rx2 = website.subscribe(16).unwrap();
let join_handle = tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("{:?}", res.get_url());
}
});
// take ownership of the website. You can also use website.run_cron, except you need to perform abort manually on handles created.
let runner = run_cron(website).await;
println!("Starting the Runner for 10 seconds");
tokio::time::sleep(Duration::from_secs(10)).await;
let _ = tokio::join!(runner.stop(), join_handle);
}
```

### Chrome

```toml
[dependencies]
spider = { version = "1.49.13", features = ["chrome"] }
spider = { version = "1.50.0", features = ["chrome"] }
```

You can use `website.crawl_concurrent_raw` to perform a crawl without chromium when needed. Use the feature flag `chrome_headed` to enable headful browser usage if needed to debug.
Expand Down
Loading

0 comments on commit 2996280

Please sign in to comment.