Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(crawl): add subdomain and tld crawling #59

Merged
merged 2 commits into from
Jun 24, 2022
Merged

Conversation

j-mendez
Copy link
Member

@j-mendez j-mendez commented Jun 24, 2022

  • add subdomain crawling ability
  • add tld crawling ability

Collectively allows for gathering all pages that relate to a website bare host name with all . or tld extenstion or subdomains without sacrificing speed on crawl.

use spider::website::Website;
fn main() {
  let mut website: Website = Website::new("https://a11ywatch.com"); 
  website.configuration.subdomains = true;
  website.configuration.tld = true;
  website.crawl();
}

--
Examples of output to validate since current test cases / examples do not use subdomains.

Before 25 links on the domain a11ywatch.com -

Before 25 links for a11ywatch.com on crawl

After 50+ links on the domain a11ywatch.com -

After 50+ links for a11ywatch.com on crawl

--

This pr combines two features into one - subdomains and tld ignoring. It might make sense moving tld to a different variable and option since anyone can own a tld thats not attached to the exact hostname. You can use the combination of blacklist url to ignore certain tld extensions. Example - you can own myspace.com and someone else has the domain for myspace.net.

@j-mendez j-mendez force-pushed the feat/crawl-subdomains branch 2 times, most recently from 8e79647 to 017baa1 Compare June 24, 2022 17:09
@j-mendez j-mendez force-pushed the feat/crawl-subdomains branch from 017baa1 to a72d1e7 Compare June 24, 2022 17:09
@j-mendez j-mendez changed the title feat(crawl): add subdomain crawling with tld ignore feat(crawl): add subdomain and tld crawling Jun 24, 2022
@j-mendez j-mendez requested a review from madeindjs June 24, 2022 17:44
@j-mendez j-mendez force-pushed the feat/crawl-subdomains branch from 23251a2 to 9664a0a Compare June 24, 2022 18:15
@j-mendez j-mendez force-pushed the feat/crawl-subdomains branch from 9664a0a to 6dec196 Compare June 24, 2022 18:17
@j-mendez j-mendez merged commit 760b3ce into master Jun 24, 2022
@j-mendez j-mendez deleted the feat/crawl-subdomains branch June 24, 2022 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant