Finds broken links in .md files in a github repository.
Checks all links found on the site or in .md files from a github repository. Default starting URL is hardcoded as https://github.com/codedokode/pasta/blob/master/README.md , but it can be changed using CLI arguments.
The scripts visits all pages on the site, finds all links within them and checks response status for those links. The list of broken links is printed to console.
URL checker makes pauses between requests. It also uses filesystem cache.
- git clone
- composer install
php checker.php -u http://example.com/
Type php checker.php --help
for help.
Choose an unused port number, start a temporary web server.
php -S 127.0.0.1:10001 -t tests/public/
Then run tests using phpunit in a separate console:
export LINK_CHECKER_TEST_SERVER_PORT=10001
phpunit
or use run-tests.sh shell script.
- script considers all non-html pages to be invalid (PDF, images)
- script cannot detect parked domains
- check fragments (page.html#something)
- use HEAD requests for leaf pages where possible
- don't cache and don't even load huge files
- be able to check local HTML files
- check image/css/js references
- pick URLs from queue so that we don't have to wait
- find and report redirects
- maybe use delay based on last 2 domain parts, not whole domain
- maybe obey robots.txt?
- links like https://mega.nz/#!12345 , https://rghost.net/12345 are not checked properly
- support some other 2xx codes like 203