Skip to content

A tool to extract plain (unformatted) multilingual text, redirects, links and categories from wikipedia backups (dumps). Designed to prepare clean training data for AI training / Machine Learning software.

License

Notifications You must be signed in to change notification settings

david-smejkal/wiki2txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wiki2txt

A tool to extract plain (unformatted) multilingual text, redirects, links and categories from wikipedia backups. Designed to prepare clean training data for AI training / Machine Learning software.

Written in Python, utilizes lxml SAX (memory efficient) parser and leans heavily on the powers of the regex library.

wiki2txt demo

Wiki XML dumps:

Installation

Supported Python versions: 3.4+
NOTICE: Older version 0.5.0 of this script works with Python 2.7+

[ Optional: Install and configure virtualenvwrapper to create your virtual environment with mkproject ]

$ mkproject wiki2txt

Alternatively if you don't have mkproject then create your virtual environment manually:

$ virtualenv --python=`which python3` wiki2txt
$ source ./wiki2txt/bin/activate
(wiki2txt) $
(wiki2txt) $ git clone https://github.com/david-smejkal/wiki2txt.git .
(wiki2txt) $ pip install -r requirements.txt

Usage

(wiki2txt) $ python wiki2txt.py --help
Usage: wiki2txt.py [options]

Options:
  --version                    show program's version number and exit
  -h, --help                   show this help message and exit
  -i FILE, --input-file=FILE   take xml input from FILE otherwise from STDIN
  -o FILE, --output-file=FILE  output parsed articles to FILE otherwise to STDOUT
  -n, --no-text                don't parse text (designed for use with -r -l -c options)
  -t, --text                   produce plain (unformatted) text (DEFAULT)
  -s NUMBER, --skip=NUMBER     skip (resume after) NUMBER of articles (append to -o FILE)
  -q, --quiet                  stop making noise
  -R, --references             retain references in text (links and categories)
  -r FILE, --redirects=FILE    outsource redirect articles to the FILE
  -l FILE, --links=FILE        capture articles' links in the FILE
  -c FILE, --categories=FILE   capture articles' categories in the FILE
  -T, --test                   test by parsing directly from STDIN (bypasses lxml parser)

Output Format

<article>
  <id>12</id>
  <title>Anarchism</title>
  <text>Anarchism is a political philosophy ...</text>
</article>

Performance

Tested using a single core of Intel i7 1.8 GHz processor

Python v3.11 (lxml v4.9.2) - Wikidump data processing speed of 9.7 MB/s
Python v3.10 (lxml v4.9.2) - Wikidump data processing speed of 9.2 MB/s
Python v3.9 (lxml v4.6.4) - Wikidump data processing speed of 7.6 MB/s
Python v2.7 (lxml v4.6.4) - Wikidump data processing speed of 5.2 MB/s
NOTICE: Parsing speed usually improves with newer versions of Python and lxml library.
e.g. parsing with python v3.11 is about 86% faster than with v2.7.

Based on the above, it should take about 2 hours to process the latest en wikidump (72 GB of decompressed data).

Examples

Download => Decompress => Parse

(wiki2txt) $ wget -O articles1.xml.bz2 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2 # 254 MB
(wiki2txt) $ bzip2 --decompress articles1.xml.bz2 # 940 MB
(wiki2txt) $ python wiki2txt.py -i articles1.xml -o parsed.xml -r redirects.edg # 400 MB

Download latest complete wikidump

$ wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 # 19 GB

HINT: add --continue parameter if you need to resume the download

Decompress

$ bzip2 --decompress enwiki-latest-pages-articles.xml.bz2 # 72 GB

HINT: add -k parameter if you want to preserve the original archive

Parse

(wiki2txt) $ python wiki2txt.py -i enwiki-latest-pages-articles.xml -o clean-data.xml

Piping input

(wiki2txt) $ cat enwiki-latest-pages-articles.xml | python wiki2txt.py > clean-data.xml

HINT: diverting output to a file like this yields slightly faster parsing (9.9 MB/s)