Skip to content

This web crawler project, built with Node.js, focuses on normalizing URLs, extracting URLs from HTML, and recursively crawling web pages. It also includes generating a report and uses Jest for test-driven development.

Notifications You must be signed in to change notification settings

WhisperNet/webCrawler-node.js

Repository files navigation

Web Crawler with Node.js

Welcome to my web crawler project! This project involved setting up a Node.js environment, normalizing URLs, extracting URLs from HTML, and recursively crawling websites to gather data. Additionally, I integrated Jest for test-driven development, providing a solid foundation for reliable and maintainable code.

Table of Contents

crawlerSS

Introduction

This web crawler project involves setting up a Node.js environment, normalizing and extracting URLs from HTML, and recursively crawling websites to gather data.

Features

  • Normalize URLs: Ensures consistency in URL format.
  • Extract URLs from HTML: Parses HTML content to find and extract URLs.
  • Recursive Crawling: Crawls web pages recursively to gather data.
  • Generating a Report: Generates a report on the status of the crawled pages.
  • Test-Driven Development: Uses Jest for testing to ensure code reliability.

Technologies Used

  • Node.js: Backend runtime environment.
  • JavaScript: Programming language for writing the crawler logic.
  • Fetch API: For making HTTP requests to web pages.
  • Jest: Testing framework for JavaScript.

Installation

To run this project locally, follow these steps:

  1. Clone the repository:

    git clone https://github.com/WhisperNet/webCrawler-node.js.git
  2. Navigate to the project directory:

    cd webCrawler-node.js
  3. Install dependencies:

    npm install

Usage

  1. Run the crawler:

    npm run start https://example.com
  2. Run tests:

    npm run test

Lessons Learned

Throughout this project, I gained valuable insights into:

  • URL Normalization: Ensuring consistency and correctness in URL formats.
  • HTML Parsing: Extracting useful information from HTML content.
  • Recursive Algorithms: Implementing recursive logic for web crawling.
  • Test-Driven Development: Writing tests with Jest to ensure code reliability and maintainability.

About

This web crawler project, built with Node.js, focuses on normalizing URLs, extracting URLs from HTML, and recursively crawling web pages. It also includes generating a report and uses Jest for test-driven development.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published