This repository contains an easy-to-use web crawler terminal program written in c#. Given an entry URL, the program will visit links and save their URLs, page titles, and meta descriptions. Optionally you can export the crawled links to a CSV.
Before you begin, ensure you have the .NET SDK installed on your machine. You can download it from the official .NET website.
After cloning the repo:
go inside the Web Crawler
directory on the terminal and run:
dotnet build
dotnet run <entry URL> <number of pages to crawl> [--csv] [--cd:<number in milliseconds>]
--csv
will export to CSV--cd:<number in milliseconds>
controls the crawl delay in milliseconds. For example,--cd:2000
specifies a crawl delay of two seconds. If this arg isn't used, the crawl delay will be 1000ms.
If you don't run the crawler with the --csv arg, it will just print the crawled webpages' URL and title to the terminal. If you export to a CSV, you won't get a terminal output.
If a title or description is not found on a webpage, the value will be displayed as none
.
Below is an example CSV of crawled links with https://google.com
as the entry URL:
url | title | description |
---|---|---|
https://google.com | none | |
https://www.google.com/intl/en/about/products?tab=wh | Browse All of Google's Products & Services - Google | Browse a list of Google products designed to help you work and play, stay organized, get answers, keep in touch, grow your business, and more. |
https://google.com/advanced_search?hl=en&authuser=0 | Google Advanced Search | none |
https://google.com/intl/en/ads/ | Google Ads - Get Customers and Sell More with Online Advertising | Discover how online advertising with Google Ads can help grow your business. Get customers and sell more with our digital advertising platform. |
https://google.com/services/ | none | none |