-
Notifications
You must be signed in to change notification settings - Fork 33
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Deploy i3thuan5/tai5-uan5_gian5-gi2_kang1-ku7 to github.com/i3thuan5/…
…tai5-uan5_gian5-gi2_kang1-ku7.git:gh-pages
- Loading branch information
0 parents
commit a74477a
Showing
75 changed files
with
9,794 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 2ea3c42a137bdd1cf3144faedaa40dc2 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
.. 臺灣言語工具 documentation master file, created by | ||
sphinx-quickstart on Tue Aug 25 08:02:47 2015. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
臺灣言語工具說明文件 | ||
======================================== | ||
|
||
目錄: | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
介紹 | ||
安裝 | ||
基本物件 | ||
常見情境 | ||
機器翻譯 | ||
語音合成 | ||
語音辨識 | ||
|
||
語言模型 | ||
斷詞 | ||
詞性標記 | ||
剖析 | ||
|
||
重音 | ||
變調 | ||
|
||
語言分類 | ||
平行語料語句對齊 | ||
|
||
開發 | ||
授權聲明 | ||
|
||
|
||
|
||
索引 | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# 介紹 | ||
|
||
## 相關專案 | ||
* [臺灣言語工具](https://github.com/sih4sing5hong5/tai5-uan5_gian5-gi2_kang1-ku7) | ||
* 母語parser、寫法轉換、…功能。 | ||
* 翻譯、語音辨識、語音合成等工具整合。 | ||
* [臺灣言語資料庫](https://github.com/sih4sing5hong5/tai5-uan5_gian5-gi2_tsu1-liau7-khoo3) | ||
* 母語資料存放規範 | ||
* [臺灣言語服務](https://github.com/sih4sing5hong5/tai5-uan5_gian5-gi2_hok8-bu7) | ||
* `臺灣言語資料庫`的套件 | ||
* 結果`臺灣言語工具`,做好自動化翻譯、語音合成等功能 | ||
* 提供Web-based的服務 | ||
* [臺灣言語平臺](https://github.com/sih4sing5hong5/tai5-uan5_gian5-gi2_phing5-thai5)。 | ||
* 修改`臺灣言語資料庫`的網頁介面 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# 剖析(Parsing) | ||
* 輸入 | ||
* 有詞性、斷詞的母語語句 | ||
* 輸出 | ||
* 每個詞在句子中的地位和相互關係 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,210 @@ | ||
# 基本物件 | ||
|
||
## 上手 | ||
`字`、`詞`、`組`、`集`、`句`佮`章`是`臺灣言語工具`操作的基本物件。若初次使用,推薦直接使用`句`物件 | ||
```python3 | ||
>>> from 臺灣言語工具.解析整理.拆文分析器 import 拆文分析器 | ||
>>> | ||
>>> 拆文分析器.建立句物件('Ta̍k-ke gâu-tsá') # 全羅馬字(全羅) | ||
句:[集:[組:[詞:[字:Ta̍k , 字:ke ], 詞:[字:gâu , 字:tsá ]]]] | ||
>>> | ||
>>> 拆文分析器.建立句物件('逐家 gâu早') # 漢羅 | ||
句:[集:[組:[詞:[字:逐 , 字:家 ], 詞:[字:gâu , 字:早 ]]]] | ||
>>> | ||
>>> 拆文分析器.建立句物件('逐家gâu早', 'Ta̍k-ke gâu-tsá') # 漢羅 kah 全羅 | ||
句:[集:[組:[詞:[字:逐 Ta̍k, 字:家 ke], 詞:[字:gâu gâu, 字:早 tsá]]]] | ||
``` | ||
|
||
## 介紹 | ||
* `章` | ||
* 包含許多`句` | ||
* `句` | ||
* 包含許多`集` | ||
* `集` | ||
* 包含許多`組`,這些`組`是代表這個語句在這個`集`,有多個`組`可以選,但是整個語句只能用其中一個`組` | ||
* `組` | ||
* 包含許多`詞` | ||
* `詞` | ||
* 包含許多`字` | ||
* `字` | ||
* `字`裡面含`型`及`音`兩個變數 | ||
* 若是漢語語系,`型`用來存漢字,`音`用來存音標。若是只有漢字或音標,則都存在`型`內底 | ||
* 若是南島語系,則都存在`型`裡面 | ||
|
||
|
||
## 建立物件 | ||
`拆文分析器`是建立`基本物件`的主要工具,可以處理大多數情形的語料,使用`拆文分析器.建立句物件(語句)`函式 | ||
```python3 | ||
>>> from 臺灣言語工具.解析整理.拆文分析器 import 拆文分析器 | ||
>>> | ||
>>> 拆文分析器.建立句物件('臺語工具') # 全漢字 | ||
章:[句:[集:[組:[詞:[字:臺 ], 詞:[字:語 ], 詞:[字:工 ], 詞:[字:具 ]]]]] | ||
>>> | ||
>>> 拆文分析器.建立句物件('tai5-gi2 kang1-ku7') # 全羅馬字音標 | ||
章:[句:[集:[組:[詞:[字:tai5 , 字:gi2 ], 詞:[字:kang1 , 字:ku7 ]]]]] | ||
>>> | ||
>>> 拆文分析器.建立句物件('臺語工ku7') # 漢字音標混合 | ||
章:[句:[集:[組:[詞:[字:臺 ], 詞:[字:語 ], 詞:[字:工 ], 詞:[字:ku7 ]]]]] | ||
>>> | ||
``` | ||
|
||
若是同時有漢字和音標,則可以傳兩ê參數`拆文分析器.建立句物件(語句, 羅馬字)` | ||
```python3 | ||
>>> from 臺灣言語工具.解析整理.拆文分析器 import 拆文分析器 | ||
>>> | ||
>>> 拆文分析器.建立句物件('臺語工具', 'tai5-gi2 kang1-ku7') # 有全部的漢字和羅馬字對應 | ||
章:[句:[集:[組:[詞:[字:臺 tai5, 字:語 gi2], 詞:[字:工 kang1, 字:具 ku7]]]]] | ||
>>> | ||
>>> 拆文分析器.建立句物件('臺語工ku7', 'tai5-gi2 kang1-ku7') # 漢羅OK | ||
章:[句:[集:[組:[詞:[字:臺 tai5, 字:語 gi2], 詞:[字:工 kang1, 字:ku7 ku7]]]]] | ||
>>> | ||
>>> 拆文分析器.建立句物件('tai5-gi2 kang1-ku7', 'tai5-gi2 kang1-ku7') # 羅馬字對應羅馬字嘛ē-sái | ||
章:[句:[集:[組:[詞:[字:tai5 tai5, 字:gi2 gi2], 詞:[字:kang1 kang1, 字:ku7 ku7]]]]] | ||
``` | ||
其中`語句`佮`羅馬字`的字數要相同,否則會擲出`解析錯誤`例外。 | ||
|
||
`拆文分析器`還有`建立章物件`、`建立集物件`、`建立組物件`、`建立詞物件`跟`建立字物件`函式。詳細行為可以參考[單元試驗](https://github.com/sih4sing5hong5/tai5-uan5_gian5-gi2_kang1-ku7/blob/master/%E8%A9%A6%E9%A9%97/%E8%A7%A3%E6%9E%90%E6%95%B4%E7%90%86/)。 | ||
|
||
### 分詞物件 | ||
`基本物件`會當輸出做`分詞`字串形態,方便tī資料庫等處理。用`物件.看分詞()`得著`分詞`,`拆文分析器.分詞句物件(分詞)`載入`分詞`。 | ||
```python3 | ||
>>> from 臺灣言語工具.解析整理.拆文分析器 import 拆文分析器 | ||
>>> | ||
>>> 句物件 = 拆文分析器.建立句物件('逐家gâu早', 'Ta̍k-ke gâu-tsá') | ||
>>> 句物件.看分詞() | ||
'逐-家|Ta̍k-ke gâu-早|gâu-tsá' | ||
>>> 拆文分析器.分詞句物件(句物件.看分詞()) == 句物件 | ||
True | ||
``` | ||
|
||
`拆文分析器`還有`分詞章物件`、`分詞集物件`、`分詞組物件`、`分詞詞物件`跟`分詞字物件`函式。詳細行為可以參考[單元試驗](https://github.com/sih4sing5hong5/tai5-uan5_gian5-gi2_kang1-ku7/blob/master/%E8%A9%A6%E9%A9%97/%E8%A7%A3%E6%9E%90%E6%95%B4%E7%90%86/)。 | ||
|
||
## 輸出 | ||
物件會使照需要拿出`型`、`音`和`分詞` | ||
```python3 | ||
>>> from 臺灣言語工具.解析整理.拆文分析器 import 拆文分析器 | ||
>>> | ||
>>> 漢字 = '臺語工具' | ||
>>> 音標 = 'tai5-gi2 kang1-ku7' | ||
>>> 章物件 = 拆文分析器.對齊章物件(漢字, 音標) | ||
>>> 章物件.看型() | ||
'臺語工具' | ||
>>> 章物件.看音() | ||
'tai5-gi2 kang1-ku7' | ||
>>> 章物件.看分詞() | ||
'臺-語|tai5-gi2 工-具|kang1-ku7' | ||
``` | ||
方便複製 | ||
```python3 | ||
from 臺灣言語工具.解析整理.拆文分析器 import 拆文分析器 | ||
|
||
漢字 = '臺語工具' | ||
音標 = 'tai5-gi2 kang1-ku7' | ||
章物件 = 拆文分析器.對齊章物件(漢字, 音標) | ||
章物件.看型() | ||
章物件.看音() | ||
章物件.看分詞() | ||
``` | ||
`分詞`包含漢字和音標,而且`分詞`能夠經由`拆文分析器`再轉成`物件` | ||
```python3 | ||
from 臺灣言語工具.解析整理.拆文分析器 import 拆文分析器 | ||
|
||
章物件 = 拆文分析器.對齊章物件('臺語工具', 'tai5-gi2 kang1-ku7') | ||
分詞 = 章物件.看分詞() | ||
分詞章物件 = 拆文分析器.分詞章物件(分詞) # 分詞章物件 == 章物件 | ||
``` | ||
|
||
### 函式定義 | ||
`元素.看型(物件分字符號='', 物件分詞符號='', 物件分句符號='')` | ||
|
||
回傳元素的全部型。參數會當設定`字`、`詞`跟`句`中的分隔符號,預設不分隔。 | ||
|
||
`元素.看音(物件分字符號=分字符號, 物件分詞符號=分詞符號, 物件分句符號=分詞符號)` | ||
|
||
回傳元素的全部音。參數會當設定`字`、`詞`跟`句`中的分隔符號。`分字符號`是`-`,`分詞符號`是` `。 | ||
|
||
`元素.看分詞(物件分型音符號=分型音符號, 物件分字符號=分字符號, 物件分詞符號=分詞符號, 物件分句符號=分詞符號):` | ||
|
||
回傳元素的分詞,這個分詞可以被`拆文分析器`的`轉做`。參數會當設定`字`、`詞`跟`句`中的分隔符號。`分字符號`是`-`,`分詞符號`是` `。 | ||
|
||
## 處理 | ||
基本物件的函式,函式回傳皆是新的物件,原本的物件並不會改變。 | ||
|
||
### 斷詞 | ||
部份漢語語料是漢字佮羅馬字混雜,為了讓語料能更一致,斷詞有兩種方法 | ||
|
||
#### 兩步斷詞 | ||
先用辭典切出對應的斷點,才閣揀出其中一個當結果 | ||
```python3 | ||
a='11' | ||
``` | ||
|
||
##### 函式定義 | ||
```python3 | ||
def 揣詞(self, 揣詞方法, *參數陣列, **參數物件): | ||
def 揀(self, 揀集內組方法, *參數陣列, **參數物件): | ||
``` | ||
|
||
#### 直接斷詞 | ||
仝款用辭典佮語言模型,`辭典語言模型斷詞`會試逐種組合,毋過速度較慢 | ||
|
||
#### 華語斷詞 | ||
中研院斷詞 | ||
|
||
##### 函式定義 | ||
```python3 | ||
def 斷詞(self, 斷詞方法, *參數陣列, **參數物件): | ||
``` | ||
|
||
### 翻譯 | ||
```python3 | ||
def 翻譯(self, 翻譯方法, *參數陣列, **參數物件): | ||
``` | ||
|
||
### 其他處理 | ||
為了介面清楚,一些不常用、不通用的功能就不另外包函式。像是要使用`閩南語變調.變調(cls, 物件)`就可以用`做`函式 | ||
```python3 | ||
物件.做(閩南語變調, '變調') | ||
``` | ||
而要參數的`集內組照排.排(cls, 排法, 物件)`可以用 | ||
```python3 | ||
物件.做(集內組照排, '排', 排法=lambda 物件:str(物件)) | ||
``` | ||
|
||
#### 函式定義 | ||
```python3 | ||
def 做(self, 模組, 函式名, *參數陣列, **參數物件): | ||
``` | ||
|
||
|
||
### 預處理(deprecated) | ||
因為閩南語語料的`-`,有斷詞意義,所以`拆文分析器`對`-`限制較濟。語句在分析前建議都先用`文章粗胚`預處理 | ||
|
||
#### 漢語 | ||
親像閩南語有輕聲`--`,`文章粗胚`會先轉做`-0` | ||
```python3 | ||
>>> from 臺灣言語工具.解析整理.文章粗胚 import 文章粗胚 | ||
>>> from 臺灣言語工具.解析整理.拆文分析器 import 拆文分析器 | ||
>>> from 臺灣言語工具.音標系統.閩南語.臺灣閩南語羅馬字拼音 import 臺灣閩南語羅馬字拼音 | ||
>>> | ||
>>> 原來語句 = '人莫走boo5--ki3。' | ||
>>> 處理好語句 = 文章粗胚.建立物件語句前處理減號(臺灣閩南語羅馬字拼音, 原來語句) # --輕聲轉成-0 | ||
# 處理好語句 == '人莫走boo5-0ki3。' | ||
>>> 加空白後語句 = 文章粗胚.符號邊仔加空白(處理好語句) # 句號旁加空白 | ||
# 加空白後語句 == '人莫走boo5-0ki3 。 ' | ||
>>> 拆文分析器.建立章物件(加空白後語句) | ||
章:[句:[集:[組:[詞:[字:人 ], 詞:[字:莫 ], 詞:[字:走 ], 詞:[字:boo5 , 字:0ki3 ], 詞:[字:。 ]]]]] | ||
``` | ||
|
||
#### 南島語 | ||
目前`拆文分析器`還不能正確處理`'`,不過流程大致如下 | ||
```python3 | ||
from 臺灣言語工具.解析整理.文章粗胚 import 文章粗胚 | ||
from 臺灣言語工具.解析整理.拆文分析器 import 拆文分析器 | ||
from 臺灣言語工具.音標系統.閩南語.臺灣閩南語羅馬字拼音 import 臺灣閩南語羅馬字拼音 | ||
|
||
原來語句 = "Nga'ay ho?" | ||
處理好語句 = 文章粗胚.建立物件語句前減號變標點符號(原來語句) | ||
加空白後語句 = 文章粗胚.符號邊仔加空白(處理好語句) | ||
拆文分析器.建立章物件(加空白後語句) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# 安裝 | ||
|
||
[臺灣言語工具](https://github.com/sih4sing5hong5/tai5-uan5_gian5-gi2_kang1-ku7)希望母語的工具不需要重覆開發,主要提供這兩種功能: | ||
* 母語parser、寫法轉換、…功能。 | ||
* 翻譯、語音辨識、語音合成等工具整合。 | ||
|
||
## Docker | ||
- 安裝 [docker](https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu/) | ||
- 安裝 [docker-compose](https://docs.docker.com/compose/install/) | ||
- 設定docker權限`sudo usermod -aG docker $USER` | ||
``` | ||
docker run -ti --rm i3thuan5/tai5-uan5_gian5-gi2_kang1-ku7 | ||
``` | ||
|
||
## 快速安裝 | ||
```bash | ||
sudo apt-get install -y python3 virtualenv g++ python3-dev zlib1g-dev libbz2-dev liblzma-dev libboost-all-dev # Ubuntu/Mint 安裝指令 | ||
virtualenv --python=python3 venv; . venv/bin/activate; pip install --upgrade pip # 設置環境檔 | ||
pip install tai5-uan5_gian5-gi2_kang1-ku7 # 安裝臺灣言語工具 | ||
``` | ||
|
||
## 詳細安裝 | ||
|
||
### 作業系統 | ||
推薦[Mint Linux](http://www.linuxmint.com/download.php)佮[Ubuntu Linux](http://www.ubuntu-tw.org/modules/tinyd0/) | ||
若是別的Linux抑是iOS攏會使 | ||
只是指令愛家己變化 | ||
|
||
### 虛擬環境設定 | ||
請先安裝python3佮[virtualenv](https://virtualenv.readthedocs.org/en/latest/) | ||
```bash | ||
sudo apt-get install -y python3 virtualenv g++ python3-dev zlib1g-dev libbz2-dev liblzma-dev libboost-all-dev # Ubuntu/Mint 安裝指令 | ||
virtualenv --python=python3 venv; . venv/bin/activate; pip install --upgrade pip # 設置環境檔 | ||
``` | ||
會當參考:[virtualenv](http://www.openfoundry.org/tw/tech-column/8516-pythons-virtual-environment-and-multi-version-programming-tools-virtualenv-and-pythonbrew)使用說明 | ||
|
||
每次使用前開啟環境 | ||
```bash | ||
. venv/bin/activate # 載入環境 | ||
``` | ||
|
||
### 安裝PYPI發行版本 | ||
```bash | ||
pip install tai5-uan5_gian5-gi2_kang1-ku7 | ||
``` | ||
|
||
### 徙掉套件 | ||
```bash | ||
pip uninstall tai5-uan5_gian5-gi2_kang1-ku7 | ||
``` | ||
|
||
## 相關套件 | ||
|
||
### [htsengine](https://github.com/sih4sing5hong5/hts_engine_python) | ||
語音合成工具,有包佇`pip install tai5-uan5_gian5-gi2_kang1-ku7`內底矣 | ||
```bash | ||
pip install htsengine | ||
``` | ||
|
||
### [Kenlm](https://github.com/kpu/kenlm) | ||
語言模型函式庫 | ||
```bash | ||
sudo apt-get install -y g++ libboost-all-dev # for Ubuntu 14.04+ /Mint 17+ | ||
pip install pypi-kenlm # 包去pypi | ||
pip install https://github.com/kpu/kenlm/archive/master.zip # 上新版本 | ||
``` | ||
|
||
### [bleualign](https://github.com/rsennrich/Bleualign) | ||
平行語句對齊函式庫 | ||
```bash | ||
pip install pypi-bleualign # 包去pypi | ||
pip install https://github.com/rsennrich/Bleualign/archive/master.zip # 上新版本 | ||
``` |
Oops, something went wrong.