Welcome to my web crawler project! This project involved setting up a Node.js environment, normalizing URLs, extracting URLs from HTML, and recursively crawling websites to gather data. Additionally, I integrated Jest for test-driven development, providing a solid foundation for reliable and maintainable code.
This web crawler project involves setting up a Node.js environment, normalizing and extracting URLs from HTML, and recursively crawling websites to gather data.
- Normalize URLs: Ensures consistency in URL format.
- Extract URLs from HTML: Parses HTML content to find and extract URLs.
- Recursive Crawling: Crawls web pages recursively to gather data.
- Generating a Report: Generates a report on the status of the crawled pages.
- Test-Driven Development: Uses Jest for testing to ensure code reliability.
- Node.js: Backend runtime environment.
- JavaScript: Programming language for writing the crawler logic.
- Fetch API: For making HTTP requests to web pages.
- Jest: Testing framework for JavaScript.
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/WhisperNet/webCrawler-node.js.git
-
Navigate to the project directory:
cd webCrawler-node.js
-
Install dependencies:
npm install
-
Run the crawler:
npm run start https://example.com
-
Run tests:
npm run test
Throughout this project, I gained valuable insights into:
- URL Normalization: Ensuring consistency and correctness in URL formats.
- HTML Parsing: Extracting useful information from HTML content.
- Recursive Algorithms: Implementing recursive logic for web crawling.
- Test-Driven Development: Writing tests with Jest to ensure code reliability and maintainability.