Skip to content

AyushAjay14/Web-Crawler

Repository files navigation

WEB CRAWLER

This is my project which can crawl a given url or domain and collect the desired information from the website.

TECH STACK :

  1. Python
  2. WEB Libraries like Beautifulsoup , SELENIUM , REQUESTS , REGEX LIBRARY , SHUTIL , ARGPARSE, RANDOM, WEBDRIVER_MANAGER.CHROME, COLORAMA.

DETAILS :

  • Main Script consist of complete_web_crawler.py
  • Rest script are the parts of my program showing the functions that are used in the main script.

INSTALLING LIBRARIES :

  1. BeautifulSoup
  2. SELENIUM
  3. REQUESTS
  4. ARGPARSE
  5. REGEX
  6. RANDOM
  7. WEBDRIVER-MANAGER
  8. COLORAMA

USAGE :

cd Web-Crawler

==>> python complete_web_crawler.py --url --depth --emails --headers --phoneno --imagelinks

  • --url <provide the desired URL >
  • --depth <provide the required depth>
  • --emails <if 1 is supplied then crawler will search for mails also and for 0 it will skip email scrapping >
  • --headers <if 1 is supplied then crawler will search for headers also and for 0 it will skip >
  • --phoneno <if 1 is supplied then crawler will search for phone numbers also and for 0 it will skip>
  • --imagelinks <if 1 is supplied then crawler will search for image links also and for 0 it will skip>
  • By default if any of the options is not provided then the option will be set to 1

EXAMPLE :

  • python complete_web_crawler.py --url https://ctftime.org --depth 1 --headers 1 --phoneno 1 --imagelinks 0

SCREENSHOT :

  • *** In order to make screenshot function work properly you need to install google chrome in your virtual machine
  • Commands are:
sudo apt install ./google-chrome-stable_current_amd64.deb
Ensure it worked:
google-chrome --version

ERRORS :

  • If you are using wsl2 and getting the following error - error Then follow the methods given here - https://www.gregbrisebois.com/posts/chromedriver-in-wsl2/

SNIPPETS :

snippet1 snippet2

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages