A tool to extract plain (unformatted) multilingual text, redirects, links and categories from wikipedia backups.
Designed to prepare clean training data for AI training / Machine Learning software.
Written in Python, utilizes lxml
SAX (memory efficient) parser and leans heavily on the powers of the regex
library.
- https://dumps.wikimedia.org/enwiki/ (English)
- https://dumps.wikimedia.org/ruwiki/ (Russian)
- https://dumps.wikimedia.org/zhwiki/ (Chinese)
Supported Python versions: 3.4+
NOTICE: Older version 0.5.0
of this script works with Python 2.7+
[ Optional: Install and configure virtualenvwrapper to create your virtual environment with mkproject
]
$ mkproject wiki2txt
Alternatively if you don't have mkproject
then create your virtual environment manually:
$ virtualenv --python=`which python3` wiki2txt
$ source ./wiki2txt/bin/activate
(wiki2txt) $
(wiki2txt) $ git clone https://github.com/david-smejkal/wiki2txt.git .
(wiki2txt) $ pip install -r requirements.txt
(wiki2txt) $ python wiki2txt.py --help
Usage: wiki2txt.py [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i FILE, --input-file=FILE take xml input from FILE otherwise from STDIN
-o FILE, --output-file=FILE output parsed articles to FILE otherwise to STDOUT
-n, --no-text don't parse text (designed for use with -r -l -c options)
-t, --text produce plain (unformatted) text (DEFAULT)
-s NUMBER, --skip=NUMBER skip (resume after) NUMBER of articles (append to -o FILE)
-q, --quiet stop making noise
-R, --references retain references in text (links and categories)
-r FILE, --redirects=FILE outsource redirect articles to the FILE
-l FILE, --links=FILE capture articles' links in the FILE
-c FILE, --categories=FILE capture articles' categories in the FILE
-T, --test test by parsing directly from STDIN (bypasses lxml parser)
<article>
<id>12</id>
<title>Anarchism</title>
<text>Anarchism is a political philosophy ...</text>
</article>
Tested using a single core of Intel i7 1.8 GHz processor
Python v3.11 (lxml v4.9.2)
- Wikidump data processing speed of 9.7 MB/s
Python v3.10 (lxml v4.9.2)
- Wikidump data processing speed of 9.2 MB/s
Python v3.9 (lxml v4.6.4)
- Wikidump data processing speed of 7.6 MB/s
Python v2.7 (lxml v4.6.4)
- Wikidump data processing speed of 5.2 MB/s
NOTICE: Parsing speed usually improves with newer versions of Python and lxml library.
e.g. parsing with python v3.11
is about 86% faster than with v2.7
.
Based on the above, it should take about 2 hours to process the latest en
wikidump (72 GB of decompressed data).
(wiki2txt) $ wget -O articles1.xml.bz2 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2 # 254 MB
(wiki2txt) $ bzip2 --decompress articles1.xml.bz2 # 940 MB
(wiki2txt) $ python wiki2txt.py -i articles1.xml -o parsed.xml -r redirects.edg # 400 MB
$ wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 # 19 GB
HINT: add --continue
parameter if you need to resume the download
$ bzip2 --decompress enwiki-latest-pages-articles.xml.bz2 # 72 GB
HINT: add -k
parameter if you want to preserve the original archive
(wiki2txt) $ python wiki2txt.py -i enwiki-latest-pages-articles.xml -o clean-data.xml
(wiki2txt) $ cat enwiki-latest-pages-articles.xml | python wiki2txt.py > clean-data.xml
HINT: diverting output to a file like this yields slightly faster parsing (9.9 MB/s
)