Performance Benchmarks

Date	URL's Crawled per Second (Total)	URL's Crawled per Second (Successful)	Crawl Success Rate (%)	Commit Hash	Notes
2024-08-14	0.27	0.16	56%	267531120494ea6d4ecb291b66ec7fc561361e09	Basic async implementation

Development

Setting Up the Development Environment

To install dependencies, run:

poetry install

This will install all the dependencies defined in your pyproject.toml file.

Activating the Virtual Environment

To activate the project's virtual environment, run:

poetry shell

This will spawn a shell with the virtual environment activated, allowing you to run Python scripts and commands within the virtual environment context.

Running Tests

To run the test suite with pytest, use:

poetry run pytest

This will execute all tests found in the tests/ directory.

Formatting Code

To format your code automatically with Black, run:

poetry run black .

Black will reformat your files in place to adhere to its style guide.

Linting Code

To lint your code with Ruff, run:

poetry run ruff .

Ruff will analyze your code for potential errors and style issues.

Profile Code

Set PROFILE=1 in run_scraper.sh
Run:

snakeviz output.prof

Notes / To Do

scalable version of this consumer/scraper should be k8s with multiple pods

for repeated scraping, store a hash of robots.txt and the page content for each url

add more metadata for each url successful_crawl_count failed_crawl_count

add automated handling for database/table setup and connection

"INFO - scraper - Metadata saved for amazonaws.com" - this logging message may not be properly setup for when the scraping fails

cleanup metadata.json this shouldn't be created anymore

maybe not all url html data getting downloaded

scraping/playwright optimizations: - Block unnecessary resources to speed up page loads

batch inserts for metadata?

kafka improvements

batch send url's to kafka
    do it time based

implement multiple consumers / multiple partitionsß

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
src/web_crawler		src/web_crawler
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
query.sql		query.sql
run_scraper.sh		run_scraper.sh
start_kafka.sh		start_kafka.sh
stop_kafka.sh		stop_kafka.sh
test_playwright.py		test_playwright.py
test_scalene.py		test_scalene.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance Benchmarks

Development

Setting Up the Development Environment

Activating the Virtual Environment

Running Tests

Formatting Code

Linting Code

Profile Code

Notes / To Do

About

Releases

Packages

Languages

lgingerich/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Performance Benchmarks

Development

Setting Up the Development Environment

Activating the Virtual Environment

Running Tests

Formatting Code

Linting Code

Profile Code

Notes / To Do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages