This is a python module to collect urls over static files (code and documentation)
and then test for and report broken links. If you are interesting in using
this as a GitHub action, see urlchecker-action. There are also container
bases available on quay.io/urlstechie/urlchecker. As of version
0.0.26, we use multiprocessing so the checks run a lot faster, and you can set URLCHECKER_WORKERS
to change the number of workers
(defaults to 9). If you don't want multiprocessing, use version 0.0.25 or earlier.
A detailed documentation of the code is available under urlchecker-python.readthedocs.io
You can install the urlchecker from pypi. Before you do, it's recommended to install fake-useragent from:
pip install git+https://github.com/danger89/fake-useragent.git
And then urlchecker:
$ pip install urlchecker
or install from the repository directly:
$ git clone https://github.com/urlstechie/urlchecker-python.git
$ cd urlchecker-python
$ python setup.py install
Installation will place a binary, urlchecker
in your Python path.
$ which urlchecker
/home/vanessa/anaconda3/bin/urlchecker
Your most likely use case will be to check a local directory with static files (documentation or code) for files. In this case, you can use urlchecker check:
$ urlchecker check --help
usage: urlchecker check [-h] [-b BRANCH] [--subfolder SUBFOLDER] [--cleanup] [--serial] [--no-check-certs]
[--force-pass] [--no-print] [--verbose] [--file-types FILE_TYPES] [--files FILES]
[--exclude-urls EXCLUDE_URLS] [--exclude-patterns EXCLUDE_PATTERNS]
[--exclude-files EXCLUDE_FILES] [--save SAVE] [--retry-count RETRY_COUNT] [--timeout TIMEOUT]
path
positional arguments:
path the local path or GitHub repository to clone and check
options:
-h, --help show this help message and exit
-b BRANCH, --branch BRANCH
if cloning, specify a branch to use (defaults to main)
--subfolder SUBFOLDER
relative subfolder path within path (if not specified, we use root)
--cleanup remove root folder after checking (defaults to False, no cleaup)
--serial run checks in serial (no multiprocess)
--no-check-certs Allow urls to validate that fail certificate checks
--force-pass force successful pass (return code 0) regardless of result
--no-print Skip printing results to the screen (defaults to printing to console).
--verbose Print file names for failed urls in addition to the urls.
--file-types FILE_TYPES
comma separated list of file extensions to check (defaults to .md,.py)
--files FILES comma separated list of exact files or patterns to check.
--exclude-urls EXCLUDE_URLS
comma separated links to exclude (no spaces)
--exclude-patterns EXCLUDE_PATTERNS
comma separated list of patterns to exclude (no spaces)
--exclude-files EXCLUDE_FILES
comma separated list of files and patterns to exclude (no spaces)
--save SAVE Path to a csv file to save results to.
--retry-count RETRY_COUNT
retry count upon failure (defaults to 2, one retry).
--timeout TIMEOUT timeout (seconds) to provide to the requests library (defaults to 5)
You have a lot of flexibility to define patterns of urls or files to skip, along with the number of retries or timeout (seconds). The most basic usage will check an entire directory. Let's clone and check the urlchecker action:
$ git clone https://github.com/urlstechie/urlchecker-action.git
$ cd urchecker-action
and run the simplest command to check the present working directory (.).
$ urlchecker check .
original path: .
final path: /tmp/urlchecker-action
subfolder: None
branch: master
cleanup: False
file types: ['.md', '.py']
files: []
print all: True
urls excluded: []
url patterns excluded: []
file patterns excluded: []
force pass: False
retry count: 2
save: None
timeout: 5
/tmp/urlchecker-action/README.md
--------------------------------
https://github.com/urlstechie/urlchecker-action/blob/master/LICENSE
https://github.com/r-hub/docs/blob/bc1eac71206f7cb96ca00148dcf3b46c6d25ada4/.github/workflows/pr.yml
https://img.shields.io/static/v1?label=Marketplace&message=urlchecker-action&color=blue?style=flat&logo=github
https://github.com/rseng/awesome-rseng
https://github.com/rseng/awesome-rseng/blob/5f5cb78f8392cf10aec2f3952b305ae9611029c2/.github/workflows/urlchecker.yml
https://github.com/HPC-buildtest/buildtest-framework/actions?query=workflow%3A%22Check+URLs%22
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action/badge
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/USRSE/usrse.github.io
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/actions?query=workflow%3ACommands
https://github.com/USRSE/usrse.github.io/blob/abcbed5f5703e0d46edb9e8850eea8bb623e3c1c/.github/workflows/urlchecker.yml
https://github.com/urlstechie/urlchecker-action/releases
https://img.shields.io/badge/license-MIT-brightgreen
https://github.com/r-hub/docs/actions?query=workflow%3ACommands
https://github.com/rseng/awesome-rseng/actions?query=workflow%3AURLChecker
https://github.com/buildtesters/buildtest
https://github.com/r-hub/docs
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action
https://github.com/urlstechie/URLs-checker-test-repo
https://github.com/marketplace/actions/urlchecker-action
https://github.com/actions/checkout
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
https://github.com/USRSE/usrse.github.io/actions?query=workflow%3A%22Check+URLs%22
https://github.com/SuperKogito/Voice-based-gender-recognition/issues
https://github.com/buildtesters/buildtest/blob/v0.9.1/.github/workflows/urlchecker.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/blob/master/.github/workflows/urlchecker-pr-label.yml
/tmp/urlchecker-action/examples/README.md
-----------------------------------------
https://github.com/urlstechie/urlchecker-action/releases
https://github.com/urlstechie/urlchecker-action/issues
https://help.github.com/en/actions/reference/events-that-trigger-workflows
Done. The following urls did not pass:
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
The url that didn't pass above is an example parameter for the library! Let's add a simple pattern to exclude it.
$ urlchecker check --exclude-pattern SuperKogito .
original path: .
final path: /tmp/urlchecker-action
subfolder: None
branch: master
cleanup: False
file types: ['.md', '.py']
files: []
print all: True
urls excluded: []
url patterns excluded: ['SuperKogito']
file patterns excluded: []
force pass: False
retry count: 2
save: None
timeout: 5
/tmp/urlchecker-action/README.md
--------------------------------
https://github.com/urlstechie/urlchecker-action/blob/master/LICENSE
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/rseng/awesome-rseng/actions?query=workflow%3AURLChecker
https://github.com/USRSE/usrse.github.io/actions?query=workflow%3A%22Check+URLs%22
https://github.com/actions/checkout
https://github.com/USRSE/usrse.github.io/blob/abcbed5f5703e0d46edb9e8850eea8bb623e3c1c/.github/workflows/urlchecker.yml
https://github.com/r-hub/docs/blob/bc1eac71206f7cb96ca00148dcf3b46c6d25ada4/.github/workflows/pr.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/blob/master/.github/workflows/urlchecker-pr-label.yml
https://github.com/rseng/awesome-rseng
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action/badge
https://github.com/urlstechie/URLs-checker-test-repo
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action
https://github.com/r-hub/docs
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks
https://github.com/buildtesters/buildtest
https://img.shields.io/badge/license-MIT-brightgreen
https://github.com/urlstechie/urlchecker-action/releases
https://github.com/marketplace/actions/urlchecker-action
https://img.shields.io/static/v1?label=Marketplace&message=urlchecker-action&color=blue?style=flat&logo=github
https://github.com/r-hub/docs/actions?query=workflow%3ACommands
https://github.com/HPC-buildtest/buildtest-framework/actions?query=workflow%3A%22Check+URLs%22
https://github.com/buildtesters/buildtest/blob/v0.9.1/.github/workflows/urlchecker.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/actions?query=workflow%3ACommands
https://github.com/USRSE/usrse.github.io
https://github.com/rseng/awesome-rseng/blob/5f5cb78f8392cf10aec2f3952b305ae9611029c2/.github/workflows/urlchecker.yml
/tmp/urlchecker-action/examples/README.md
-----------------------------------------
https://help.github.com/en/actions/reference/events-that-trigger-workflows
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/urlstechie/urlchecker-action/releases
Done. All URLS passed.
We can also filter by file types. If we want to do this (for example, to only check different file types) we might do any of the following:
# Check only html files
urlchecker check --file-types *.html .
# Check hidden flies
urlchecker check --file-types ".*" .
# Check hidden files and html files
urlchecker check --file-types ".*,*.html" .
Note that while some patterns will work without quotes, it's recommended for most
to use them because if the shell expands any part of the pattern, it will not work as
expected. By default, the urlchecker checks python and markdown. If a multiprocessing workers has an error,
you can also add --serial
to run in serial and test. The run will be slower, but it's useful for debugging.
$ urlchecker check . --files "content/docs/hacking/contributing/documentation/index.md" --serial
But wouldn't it be easier to not have to clone the repository first?
Of course! We can specify a GitHub url instead, and add --cleanup
if we want to clean up the folder after.
$ urlchecker check https://github.com/SuperKogito/SuperKogito.github.io.git
If you specify any arguments for a white list (or any kind of expected list) make sure that you provide a comma separated list without any spaces
$ urlchecker check --exclude-files=README.md,_config.yml
If you want to save your results to file, perhaps for some kind of record or
other data analysis, you can provide the --save
argument:
$ urlchecker check --save results.csv .
The file that you save to will include a comma separated value tabular listing
of the urls, and their result. The result options are "passed" and "failed"
and the default header is URL,RESULT
. All of these defaults are exposed
if you want to change them (e.g., using a tab separator or a different header)
if you call the function from within Python. Here is an example of the default file
produced, which should satisfy most use cases:
URL,RESULT
https://github.com/SuperKogito,passed
https://www.google.com/,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues,passed
https://github.com/SuperKogito/Voice-based-gender-recognition,passed
https://github.com/SuperKogito/spafe/issues/4,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues/2,passed
https://github.com/SuperKogito/spafe/issues/5,passed
https://github.com/SuperKogito/URLs-checker/blob/master/README.md,passed
https://img.shields.io/,passed
https://github.com/SuperKogito/spafe/,passed
https://github.com/SuperKogito/spafe/issues/3,passed
https://www.google.com/,passed
https://github.com/SuperKogito,passed
https://github.com/SuperKogito/spafe/issues/8,passed
https://github.com/SuperKogito/spafe/issues/7,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues/1,passed
https://github.com/SuperKogito/spafe/issues,passed
https://github.com/SuperKogito/URLs-checker/issues,passed
https://github.com/SuperKogito/spafe/issues/2,passed
https://github.com/SuperKogito/URLs-checker,passed
https://github.com/SuperKogito/spafe/issues/6,passed
https://github.com/SuperKogito/spafe/issues/1,passed
https://github.com/SuperKogito/URLs-checker/README.md,failed
https://github.com/SuperKogito/URLs-checker/issues/3,failed
https://none.html,failed
https://github.com/SuperKogito/URLs-checker/issues/2,failed
https://github.com/SuperKogito/URLs-checker/README.md,failed
https://github.com/SuperKogito/URLs-checker/issues/1,failed
https://github.com/SuperKogito/URLs-checker/issues/4,failed
If you want to check a list of urls outside of the provided client, this is fairly easy to do! Let's say we have a path, our present working directory, and we want to check .py and .md files (the default)
from urlchecker.core.check import UrlChecker
import os
path = os.getcwd()
checker = UrlChecker(path)
# UrlChecker:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python
And of course you can provide more substantial arguments to derive the original file list:
checker = UrlChecker(
path=path,
file_types=[".md", ".py", ".rst"],
include_patterns=[],
exclude_files=["README.md", "LICENSE"],
print_all=True,
)
I can then run the checker like this:
checker.run()
Or with more customization of excluded urls:
checker.run(
exclude_urls=exclude_urls,
exclude_patterns=exclude_patterns,
retry_count=3,
timeout=5,
)
You'll get the results object returned, which is also available at checker.results
,
a simple dictionary with "passed" and "failed" keys to show passes and fails across
all files.
{'passed': ['https://github.com/SuperKogito/spafe/issues/4',
'http://shachi.org/resources',
'https://superkogito.github.io/blog/SpectralLeakageWindowing.html',
'https://superkogito.github.io/figures/fig4.html',
'https://github.com/urlstechie/urlchecker-test-repo',
'https://www.google.com/',
...
'https://github.com/SuperKogito',
'https://img.shields.io/',
'https://www.google.com/',
'https://docs.python.org/2'],
'failed': ['https://github.com/urlstechie/urlschecker-python/tree/master',
'https://github.com/SuperKogito/Voice-based-gender-recognition,passed',
'https://github.com/SuperKogito/URLs-checker/README.md',
...
'https://superkogito.github.io/tables',
'https://github.com/SuperKogito/URLs-checker/issues/2',
'https://github.com/SuperKogito/URLs-checker/README.md',
'https://github.com/SuperKogito/URLs-checker/issues/4',
'https://github.com/SuperKogito/URLs-checker/issues/3',
'https://github.com/SuperKogito/URLs-checker/issues/1',
'https://none.html']}
You can look at checker.checks
, which is a dictionary of result objects,
organized by the filename:
for file_name, result in checker.checks.items():
print()
print(result)
print("Total Results: %s " % result.count)
print("Total Failed: %s" % len(result.failed))
print("Total Passed: %s" % len(result.passed))
...
UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/tests/test_files/sample_test_file.md
Total Results: 26
Total Failed: 6
Total Passed: 20
UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/.pytest_cache/README.md
Total Results: 1
Total Failed: 0
Total Passed: 1
UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/.eggs/pytest_runner-5.2-py3.7.egg/ptr.py
Total Results: 0
Total Failed: 0
Total Passed: 0
UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/docs/source/conf.py
Total Results: 3
Total Failed: 0
Total Passed: 3
For any result object, you can print the list of passed, falied, white listed, or all the urls.
result.all
['https://www.sphinx-doc.org/en/master/usage/configuration.html',
'https://docs.python.org/3',
'https://docs.python.org/2']
result.failed
[]
result.exclude
[]
result.passed
['https://www.sphinx-doc.org/en/master/usage/configuration.html',
'https://docs.python.org/3',
'https://docs.python.org/2']
result.count
3
If you start with a list of urls you want to check, you can do that too!
from urlchecker.core.urlproc import UrlCheckResult
urls = ['https://www.github.com', "https://github.com", "https://banana-pudding-doesnt-exist.com"]
# Instantiate an empty checker to extract urls
checker = UrlCheckResult()
File name None is undefined or does not exist, skipping extraction.
If you provied a file name, the urls would be extracted for you.
checker = UrlCheckResult(
file_name=file_name,
exclude_patterns=exclude_patterns,
exclude_urls=exclude_urls,
print_all=self.print_all,
)
or you can provide all the parameters without the filename:
checker = UrlCheckResult(
exclude_patterns=exclude_patterns,
exclude_urls=exclude_urls,
print_all=self.print_all,
)
If you don't provide the file_name to check urls, you can give the urls
you defined previously directly to the check_urls
function:
checker.check_urls(urls)
https://www.github.com
https://github.com
HTTPSConnectionPool(host='banana-pudding-doesnt-exist.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f989abdfa10>: Failed to establish a new connection: [Errno -2] Name or service not known'))
https://banana-pudding-doesnt-exist.com
And of course you can specify a timeout and retry:
checker.check_urls(urls, retry_count=retry_count, timeout=timeout)
After you run the checker you can get all the urls, the passed, and failed sets:
checker.failed
['https://banana-pudding-doesnt-exist.com']
checker.passed
['https://www.github.com', 'https://github.com']
checker.all
['https://www.github.com',
'https://github.com',
'https://banana-pudding-doesnt-exist.com']
checker.all
['https://www.github.com',
'https://github.com',
'https://banana-pudding-doesnt-exist.com']
checker.count
3
If you have any questions, please don't hesitate to open an issue.
A Docker container is provided if you want to build a base container with urlchecker, meaning that you don't need to install it on your host. You can build the container as follows:
docker build -t urlchecker .
And then the entrypoint will expose the urlchecker.
docker run -it urlschecker
The module is organized as follows:
├── client # command line client
├── main # functions for supported integrations (e.g., GitHub)
├── core # core file and url processing tools
└── version.py # package and versioning
In the "client" folder, for example, the commands that are exposed for the client
(e.g., check) would named accordingly, e.g., client/check.py
.
Functions for Github are be provided in main/github.py
. This organization should
be fairly straight forward to always find what you are looking for.
To test more difficult urls, we use a web driver, and you can choose between:
- Chrome Driver
- Gecko Driver (firefox)
both to be used with selenium. This driver is optional, but will come by default with our action. To install it, you can download the driver at either of the links above and ensure you install selenium:
$ pip install urlchecker[selenium]
and either:
- Add it directly to your path
- Export the directory where it lives as
URLCHECKER_DRIVERS_PATH
- Put it in the root of the urlchecker clone (it will be looked for here)
If you need help, or want to suggest a project for the organization, please open an issue