crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs.
pip install crau
Archive a list of URLs by passing them via command-line:
crau archive myarchive.warc.gz http://example.com/page-1 http://example.org/page-2 ... http://example.net/page-N
or passing a text file (one URL per line):
echo "http://example.com/page-1" > urls.txt
echo "http://example.org/page-2" >> urls.txt
echo "http://example.net/page-N" >> urls.txt
crau archive myarchive.warc.gz -i urls.txt
Run crau archive --help
for more options.
List archived URLs in a WARC file:
crau list myarchive.warc.gz
Extract a file from an archive:
crau extract myarchive.warc.gz https://example.com/page.html extracted-page.html
Run a server on localhost:8080 to play your archive:
crau play myarchive.warc.gz
If you've mirrored a website using wget -r
, httrack
or a similiar tool in
which you have the files in your file system, you can use crau
to create a
WARC file based on this. Run:
crau pack [--inner-directory=path] <start-url> <path-or-archive> <warc-filename>
Where:
start-url
: base URL you've downloaded (this will be joined with the actual file names to create the complete URL).path-or-archive
: path where the files are located. Can also be a.tar.gz
,.tar.bz2
,.tar.xz
or.zip
archive.crau
will retrieve all files recursively.warc-filename
: file to be created.--inner-directory
: used when a TAR/ZIP archive is passed to filter which directory inside the archive will be used to retrieve files. Example: you have an archive with abackup/
directory on the root and awww.example.com/
inside of it, so the files are actually insidebackup/www.example.com/
- just pass--inner-directory=backup/www.example.com/
and only the files inside this path will be considered (in this example, the filebackup/www.example.com/contact.html
will be archived as<start-url>/contact.html
).
There are other archiving tools, of course. The motivation to start this project was a lack of easy, fast and robust software to archive URLs - I just wanted to execute one command without thinking and get a WARC file. Depending on your problem, crau may not be the best answer - check out more archiving tools in awesome-web-archiving.
Why not GNU Wget?
- Lacks parallel downloading;
- Some versions just crashes with segmentation fault depending on the website;
- Lots of options make the task of archiving difficult;
- There's no easy way to extend its behavior.
Why not Wpull?
- Lots of options make the task of archiving difficult;
- Easiest to extend than wget, but still difficult comparing to crau (since crau uses scrapy).
Why not crawl?
- Lacks some features and it's difficult to contribute to (the Gitlab instance where it's hosted doesn't allow registration);
- Has some bugs regarding to collecting page dependencies (like static assets inside a CSS file);
- Has a bug where it enters in a loop (if a static asset returns a HTML page instead of the expected file it ignores depth and keep trying to get this page's dependencies - if any of the latter dependencies also has the same problem it keeps going on infinite depth).
Why not archivenow?
This tool can be used easily to use archiving services such as archive.is via command-line and can also, but when archiving it calls wget to do the job.
Clone the repository:
git clone https://github.com/turicas/crau.git
Install development dependencies (you may want to create a virtualenv):
cd crau && pip install -r requirements-development.txt
Install an editable version of the package:
pip install -e .
Modify everything you want to, commit to another branch and then create a pull request at GitHub.