Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler.
Notable Features:
- Written in Python: lightweight, modifiable, robust, & scriptable
- Graceful stopping; on-disk database resume
- PhantomJS & youtube-dl integration (experimental)
Wpull uses Python 3.
Once Python is installed, download Wpull from PyPI using pip:
pip3 install wpull
For detailed installation instructions and potential caveats, please see https://wpull.readthedocs.io/en/master/install.html.
To download the About page of Google.com:
wpull google.com/about
To archive a website:
wpull billy.blogsite.example \ --warc-file blogsite-billy \ --no-check-certificate \ --no-robots --user-agent "InconspiuousWebBrowser/1.0" \ --wait 0.5 --random-wait --waitretry 600 \ --page-requisites --recursive --level inf \ --span-hosts-allow linked-pages,page-requisites \ --escaped-fragment --strip-session-id \ --sitemaps \ --reject-regex "/login\.php" \ --tries 3 --retry-connrefused --retry-dns-error \ --timeout 60 --session-timeout 21600 \ --delete-after --database blogsite-billy.db \ --quiet --output-file blogsite-billy.log
To see all options:
wpull --help
Documentation is located at https://wpull.readthedocs.io/. Please have a look at it before using Wpull's advanced features.
Need help? Please see our Help page which contains frequently asked questions and support information.
The issue tracker is located at https://github.com/chfoo/wpull/issues.
Contributions and feedback are greatly appreciated.
Copyright 2013-2016 by Christopher Foo and others. License GPL v3.
This project contains third-party source code licensed under different terms:
- wpull.backport.logging
- wpull.thirdparty.robotexclusionrulesparser
- wpull.thirdparty.dammit
We would like to acknowledge the authors of GNU Wget as Wpull uses algorithms from Wget.