-
Notifications
You must be signed in to change notification settings - Fork 969
Home
WikiExtractor.py is a script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires no additional library.
Wikipedia articles are written in the MediaWiki Markup Language which provides a simple notation for formatting text (bolds, italics, underlines, images, tables, etc.). It also allows inserting HTML markup in the documents. Wiki and HTML tags are sometimes misused (unclosed tags, wrong attributes, etc.), therefore the extractor deploys some heuristics in order to circumvent such problems.
WikiExtractor.py is capable of performing template expansion to some extent: it does not fully support Lua modules.
The script is invoked with a Wikipedia dump file as an argument. Use the article dumps which are available as http://dumps.wikimedia.org/XXwiki/latest/XXwiki-latest-pages-articles.xml.bz2, where XX is the language identifier (e.g. en, es, zh).
The output is stored in a number of files of similar size in a chosen directory. Each file will contains several documents in this file format.
Template expansion requires preprocessing first the whole dump and collecting template definitions.
Usage:
python -m wikiextractor.WikiExtractor [options] xml-dump-file
optional arguments:
-h, --help show this help message and exit -o OUTPUT, --output OUTPUT directory for extracted files (or '-' for dumping to stdout) --processes PROCESSES Number of processes to use (default 23) -b n[KMG], --bytes n[KMG] put specified bytes per output file (default is 1M) -c, --compress compress output files using bzip -l, --links preserve links -ns ns1,ns2, --namespaces ns1,ns2 accepted namespaces -q, --quiet suppress reporting progress info --debug print debug info -s, --sections preserve sections -a, --article analyze a file containing a single article --templates TEMPLATES use or create file containing templates --no_templates do not expand templates -r, --revision Include the document revision id (default=False) --min_text_length MIN_TEXT_LENGTH Minimum expanded text length required to write document (default=0) --filter_disambig_pages Remove pages from output that contain disabmiguation markup (default=False) --threads THREADS Number of threads to use (default 8) -v, --version print program version
Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.
Option --no-templates significantly speeds up the extractor, avoiding the cost of expanding MediaWiki templates.
- All Wikipedia database dumps
- torrents for use with a BitTorrent client such as uTorrent
- WikiPrep A Perl tool for preprocessing Wikipedia XML dumps.
- Extracting Text from Wikipedia Another Python tool for text extracting from Wikipedia XML dumps.
- Alternative Parsers A list of links, descriptions, and status reports of the various alternative MediaWiki parsers.