Hilofumi Yamamoto, Ph.D. (Tokyo Institute of Technology)
Bor Hodošček, D.Engineering (Osaka University)
The Hachidaishu database encoded into TEI format is in the hachidaishu.xml
file.
It was generated using the Hachidaishu_WLSP_TEI_Conversion.ipynb notebook contained in this repository.
The notebook uses (mostly) publicly available resources to create the TEI encoding in conjunction with the two Python scripts dictionaryconverter.py
and hachidaishu.py
in this repo.
dictionaryconverter.py defines the IPAdic to UniDic and UniDic to UD POS mappings, while hachidaishu.py is a helper library for reading the hachidai.db
database format.
A simplified version of the TEI encoding is available in a flat JSON line-delimited format.
Example: #1 Kokinshu
01:000001:0001 A00 BG-01-1630-01-0100 02 年 年 とし 年 とし
01:000001:0001 A10 BG-01-1911-03-1800 02 年 年 とし 年 とし
01:000001:0002 A00 BG-08-0061-07-0100 61 の の の の の
01:000001:0003 A00 BG-01-1770-01-0300 02 内 内 うち 内 うち
01:000001:0004 A00 BG-08-0061-05-0100 61 に に に に に
01:000001:0005 A00 BG-01-1624-02-0100 02 春 春 はる 春 はる
01:000001:0006 A00 BG-08-0065-07-0100 65 は は は は は
01:000001:0007 A00 BG-02-1527-01-0102 47 き 来 く 来 き
01:000001:0008 A00 BG-03-1200-02-0900 74 に ぬ ぬ に に
01:000001:0008 A10 BG-09-0010-01-0101 74 に ぬ ぬ に に
01:000001:0008 A20 BG-09-0010-03-0200 74 に ぬ ぬ に に
01:000001:0009 A00 BG-09-0010-04-0300 74 けり けり けり けり けり
01:000001:0010 B00 BG-01-1950-14-0100 02 一とせ 一年 ひととせ 一年 ひととせ
01:000001:0010 C00 BG-01-1950-01-0300 19 一 一 いち 一 いち
01:000001:0010 C01 BG-01-1630-01-0100 02 年 年 とし 年 とし
01:000001:0011 A00 BG-08-0061-10-0100 61 を を を を を
01:000001:0012 A00 BG-01-1642-02-0100 02 こそ 去年 こぞ 去年 こぞ
01:000001:0013 A00 BG-08-0061-04-0100 61 と と と と と
01:000001:0014 A00 BG-08-0065-14-0100 65 や や や や や
01:000001:0015 A00 BG-02-3120-01-0100 47 いは 言ふ いふ 言は いは
01:000001:0016 A00 BG-03-3012-03-2600 74 ん む む む む
01:000001:0016 A10 BG-09-0010-02-0102 74 ん む む む む
01:000001:0017 B00 BG-01-1641-02-0100 02 ことし 今年 ことし 今年 ことし
01:000001:0017 C00 BG-03-1000-01-0100 57 この この この この この
01:000001:0017 C01 BG-01-1630-01-0100 02 年 年 とし 年 とし
01:000001:0018 A00 BG-08-0061-04-0100 61 と と と と と
01:000001:0019 A00 BG-08-0065-14-0100 65 や や や や や
01:000001:0020 A00 BG-02-3120-01-0100 47 いは 言ふ いふ 言は いは
01:000001:0021 A00 BG-03-3012-03-2600 74 ん む む む む
01:000001:0021 A10 BG-09-0010-02-0102 74 ん む む む む
A line consists of 7 columns separated by spaces.
01:000001:0007 A00 BG-02-1527-01-0102 47 き 来 く 来 き
- 1st column "01:000001:0007" consists of 3 fields: 1) anthology, 2) number of poem, and 3) serial ID of the token. The anthology ID indicates respectively: 01..Kokinshu, 02..Gosenshu, 03..Shuishu, 04..Goshuishu, 05..Kin'yoshu, 06..Shikashu, 07..Senzaishu, and 08..Shinkokinshu.
- 2nd column indicates type of token: A type is a single token; B type is a compound token; C type is a breakdown of B type. A00 indicates a single token; A01 indicates a single token and has another meaning; B00 indicates a compound token; B01 indicates a compound token which has another meaning; C00 indicates the first element of the B00/B01.. breakdown; C01 indicates the second element of the B00/B01.. breakdown.
- 3rd column "BG-02-1527-01-0102": classification ID based on semantic categories according to Bunruigoihyo (Yamazaki et al. 2014).
- 4th column indicates a Chasen POS number.
- 5th column indicates surface form: a form appears in literary works.
- 6th column indicates lemma in kanji writing.
- 7th column indicates lemma in kana writing.
- 8th column indicates conjugated form in kanji writing form.
- 9th column indicates conjugated form in kana writing form.
Please see the notebook provided in this repository for some examples on loading and analysing the dataset from Python. Note that a newer version of the code was refactored into hachidaishu.py.
-
Yamamoto, Hilofumi (2007) Thesaurus of Japanese Poetic Vocabulary Based on the Semantic Classifications Chart, The 13th Annual Symposium for Database of the Humanities, 1-8, The Association for Database of the Humanities, Osaka.
-
Yamamoto, Hilofumi (2009) Thesaurus for the Hachidaishu (ca. 905-1205) with the classification codes based on semantic principles, Nihongo no Kenkyu / Studies in the Japanese Language, 46-52, Society for Japanese Linguistics, 5, 1, ISSN1349-5119.
-
Yamamoto, Hilofumi (2021) Hachidaishu vocabulary dataset, Zenodo, version 1.0.1, https://doi.org/10.5281/zenodo.4744170
-
Yamazaki, Makoto and Kashino, Wakako and Uchiyama, Kiyoko and Sunaoka, Kazuko, and Tajima, Ikudo and Yamamoto, Hilofumi and Han, Yoo-Sik and Seol, Geun-Su (2014) Bunruigoihyo zouhokaiteiban" e no anoteishion: kihongi no kettei (in Japanese), Keiryo Kokugo gakkai dai 58 kai taikai yokoshu, pp. 7--12.
-
二十一代集 DOI: 10.20730/200007092: ROIS-DS人文学オープンデータ共同利用センター 新日本古典籍総合データベース(200007093)