WARC (Web ARChive) file tools for python 2/3 based on the WARC 1.0 spec and compatible with the Internet Archive's ARC File Format originally developed by Hanzo Archives.
pip install warctools
from hanzo import warctools
Write a WARC file:
import os
from hanzo import warctools
def write():
headers = [
(b'WARC-Type', b'warcinfo'),
(b'WARC-Date', b'2019-11-19T23:08:51.182451Z'),
(b'WARC-Filename', b'CRAWL-20191119230851-00000-hostname.warc.gz'),
(b'WARC-Record-ID', b'<urn:uuid:8cc5dcae-0b21-11ea-842b-525476278032>')
]
content_type = b'application/warc-fields'
content = 'This\nis\nonly\na\ntest\n'.encode()
fname = 'test.warc.gz'
mode = 'ab'
if not os.path.exists(fname):
mode = 'wb'
with open(fname, mode) as _fh:
content = (content_type, content)
record = warctools.WarcRecord(headers=headers, content=content)
record.write_to(_fh, gzip="record")
Returns 0 if the arguments are all valid W/ARC files, non-zero on error.
[warctools] $ warcvalid -h
Usage: warcvalid [options] warc warc warc
Options:
-h, --help show this help message and exit
-l LIMIT, --limit=LIMIT
-I INPUT_FORMAT, --input=INPUT_FORMAT
-L LOG_LEVEL, --log-level=LOG_LEVEL
Writes human readable summary of warcfiles. Autodetects input format when filenames are passed, i.e recordgzip vs plaintext, WARC vs ARC. Assumes uncompressed warc on stdin if no args.
[warctools] $ warcdump -h
Usage: warcdump [options] warc warc warc
Options:
-h, --help show this help message and exit
-l LIMIT, --limit=LIMIT
-I INPUT_FORMAT, --input=INPUT_FORMAT
-L LOG_LEVEL, --log-level=LOG_LEVEL
Searches all headers for regex pattern. Autodetects and stdin like warcdump. Prints out a WARC format by default. Use -i to invert search. Use -U to constrain to url. Use -T to constrain to record type. Use -C to constrain to content-type.
$ warcfilter -h
Usage: warcfilter [options] pattern warc warc warc
Options:
-h, --help show this help message and exit
-l LIMIT, --limit=LIMIT
limit (ignored)
-I INPUT_FORMAT, --input=INPUT_FORMAT
input format (ignored)
-i, --invert invert match
-U, --url match on url
-T, --type match on (warc) record type
-C, --content-type match on (warc) record content type
-H, --http-content-type
match on http payload content type
-D, --warc-date match on WARC-Date header
-L LOG_LEVEL, --log-level=LOG_LEVEL
log level(ignored)
Autodetects compression on file args. Assumes uncompressed stdin if none. Use -Z to write compressed output, i.e warc2warc -Z input > input.gz. Should ignore buggy records in input.
[warctools] $ warc2warc -h
Usage: warc2warc [options] url (url ...)
Options:
-h, --help show this help message and exit
-o OUTPUT, --output=OUTPUT
output warc file
-l LIMIT, --limit=LIMIT
-I INPUT_FORMAT, --input=INPUT_FORMAT
(ignored)
-Z, --gzip compress output, record by record
-D, --decode_http decode http messages (strip chunks, gzip)
-L LOG_LEVEL, --log-level=LOG_LEVEL
--wget-chunk-fix skip transfer-encoding headers in http records, when
decoding them (-D)
Creates a crappy WARC file from arc files on input. A handful of headers are preserved. Use -Z to write compressed output, i.e arc2warc -Z input.arc > input.warc.gz
[warctools] $ arc2warc -h
Usage: arc2warc [options] arc (arc ...)
Options:
-h, --help show this help message and exit
-o OUTPUT, --output=OUTPUT
output warc file
-l LIMIT, --limit=LIMIT
-Z, --gzip compress
-L LOG_LEVEL, --log-level=LOG_LEVEL
--description=DESCRIPTION
--operator=OPERATOR
--publisher=PUBLISHER
--audience=AUDIENCE
--resource=RESOURCE
--response=RESPONSE
DEPRECATED, use CDX-writer
branch.
#WARC-filename offset warc-type warc-subject-uri warc-record-id content-type content-length
warccrap/mywarc.warc 1196018 request /images/slides/hanzo_markm__wwwoh.pdf <urn:uuid:fd1255a8-d07c-11df-b125-12313b0a18c6> application/http;msgtype=request 193
warccrap/mywarc.warc 1196631 response http://www.hanzoarchives.com/images/slides/hanzo_markm__wwwoh.pdf <urn:uuid:fd2614f8-d07c-11df-b125-12313b0a18c6> application/http;msgtype=response 3279474
- arc2warc uses the conversion rules from the earlier arc2warc.c as a starter for converting the headers
- I haven't profiled the code yet (and don't plan to until it falls over)
- Warcvalid barely skirts some of the iso standard, missing things:
- strict whitespace
- required headers check
- mime quoted printable header encoding
- treating headers as utf8
- Lots more testing
- Support pre-1.0 WARC files
- Add more documentation
- Support more commandline options for output and filenames
- S3 urls
Originally developed by "tef" thomas.figg@hanzoarchives.com
.
@internetarchive