Skip to content
This repository has been archived by the owner on Jan 12, 2025. It is now read-only.

Commit

Permalink
Description of purpose
Browse files Browse the repository at this point in the history
  • Loading branch information
pmyteh committed Sep 25, 2014
1 parent e09b694 commit 1f0e290
Showing 1 changed file with 12 additions and 3 deletions.
15 changes: 12 additions & 3 deletions Readme.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
warctika: Python library for reading and writing warc files, and processing
their contents through Apache Tika
============================================================================
warctika: Python library for processing WARC files through Apache Tika
======================================================================

This library is designed to handle web crawl data fetched using the
Heritrix web crawler (or other tools producing WARC files), extract
the plain text from structured formats and resave the data as WARC
"conversion" records.

The primary use for this tool is to extract text from webcrawl data
sets for use in machine learning and supervised classification work.

This library was originally based upon the "warc" library by the Internet
Archive and others, but now relies upon the hanzo warctools. These can be
Expand All @@ -15,5 +22,7 @@ License
-------

This software is licensed under GPL v2 or later. See LICENSE_ file for details.
The contents of the warcresponseparse.py file are derived directly from
Hanzo warctools code and can be used under the terms of the MIT license.

.. LICENSE: http://github.com/pmyteh/warctika/blob/master/LICENSE

0 comments on commit 1f0e290

Please sign in to comment.