Python Website Crawler/Analyzer

This project is used as an example that we build up in my book: Learning Concurrency with Python

The PyCrawler is a multithreaded Python web crawler that utilizes BeautifulSoup and the urllib.request modules in order to crawl every page of a given website.

Example

> python3.6 main.py
> Website > https://tutorialedge.net
# Performs crawl
# Outputs to results.csv

This crawler was built with the intention of being a teaching resource for my upcoming book Learning Concurrency with Python.

Features

Crawls all links of a given web domain and checks to see response status code.
Performs all crawls using concurrent python practices
Writes the results of each page crawl back to a results.csv file

Todo

Analyse each page to check to see if it complies with on-site SEO checklist

On-Page Checklist

Must contain a single <title> tag within the <head>
Must contain <meta name="description" content="...">
Images should contain alt="keyword" attribute
Report back 5 most popular keywords on all pages
Report back the size of each page and split this into buckets of js, css, images, other...
Report back the number of requests each page take to load and categorize each into buckets
Perform load time analysis of each page
Report back if Javascript and CSS is not minified

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
crawler		crawler
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
readme.md		readme.md
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Website Crawler/Analyzer

Example

Features

Todo

On-Page Checklist

About

Releases

Packages

Languages

elliotforbes/python-crawler

Folders and files

Latest commit

History

Repository files navigation

Python Website Crawler/Analyzer

Example

Features

Todo

On-Page Checklist

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages