Generates plain or tokenized text files from the Aozora Bunko [English] for use in corpus-based studies.
Primarily for use in an upcoming research project.
WARNING: Currently, the tool requires a checked-out repository of the Aozora Bunko. A git clone will take up to several hours and take up around 14GB of space. Future versions will ease this requirement.
You must install MeCab and UniDic.
On Debian-based distros, the command below should suffice:
sudo apt install -y mecab libmecab-dev unidic-mecab
MacOS users can install the native dependencies with:
brew install mecab mecab-unidic
Python 3 is required. All testing is done on the latest stable version (currently 3.6.2), but a slightly older version should also work. Native dependencies must be installed before installing the Python dependencies (natto-py needs MeCab).
This project uses pipenv. For existing users, the command below should suffice:
pipenv install
pipenv shell
For those using pip
, you can install all the dependencies using the command below:
pip install natto-py jaconv lxml html5_parser
Clone the repository and run:
git clone https://github.com/borh/aozora-corpus-generator.git
cd aozora-corpus-generator
pipenv install
pipenv shell
python aozora-corpus-generator.py --features 'orth' --author-title-csv 'author-title.csv' --out 'Corpora/Japanese' --parallel
You may also use the Pipenv script shortcut to run the program:
pipenv run aozora --features 'orth' --author-title-csv 'author-title.csv' --out 'Corpora/Japanese' --parallel
python aozora-corpus-generator.py --help
usage: aozora-corpus-generator.py [-h] [--features FEATURES [FEATURES ...]]
[--features-opening-delim FEATURES_OPENING_DELIM]
[--features-closing-delim FEATURES_CLOSING_DELIM]
[--author-title-csv AUTHOR_TITLE_CSV [AUTHOR_TITLE_CSV ...]]
[--aozora-bunko-repository AOZORA_BUNKO_REPOSITORY]
--out OUT [--all] [--min-tokens MIN_TOKENS]
[--no-punc] [--incremental] [--parallel]
[--verbose]
aozora-corpus-generator extracts given author and book pairs from Aozora Bunko and formats them into (optionally tokenized) plain text files.
optional arguments:
-h, --help show this help message and exit
--features FEATURES [FEATURES ...]
specify which features should be extracted from
morphemes (default='orth')
--features-opening-delim FEATURES_OPENING_DELIM
specify opening char to use when outputting multiple
features
--features-closing-delim FEATURES_CLOSING_DELIM
specify closing char to use when outputting multiple
features
--author-title-csv AUTHOR_TITLE_CSV [AUTHOR_TITLE_CSV ...]
one or more UTF-8 formatted CSV input file(s)
(default='author-title.csv')
--aozora-bunko-repository AOZORA_BUNKO_REPOSITORY
path to the aozorabunko git repository (default='aozor
abunko/index_pages/list_person_all_extended_utf8.zip')
--out OUT output (plain, tokenized) files into given output
directory (default=Corpora)
--all specify if all Aozora Bunko texts should be extracted,
ignoring the author-title.csv (default=False)
--min-tokens MIN_TOKENS
specify minimum token count to filter files by
(default=30000)
--no-punc specify if punctuation should be discarded from
tokenized version (default=False)
--incremental do not overwrite existing corpus files (default=False)
--parallel specify if processing should be done in parallel
(default=True)
--verbose turns on verbose logging (default=False)
Example usage:
python aozora-corpus-generator.py --features 'orth' --author-title-csv 'author-title.csv' --out 'Corpora/Japanese' --parallel
You may specify multiple values for the --features
and author-title-csv
parameters by putting a space between them like so: --features orth lemma pos1
.
- "Gaiji" characters with provided JIS X 0213 codepoints are converted to their equivalent Unicode codepoint. Aozora Bunko is conservative in encoding rare Kanji, and, therefore, uses images (html version) or textual descriptions (plaintext version).
- Words are sometimes emphasized in Japanese text with dots above characters, while Aozora Bunko uses bold text in their place. Emphasis tags are currently stripped.