myTokenize is a Python library that tokenizes Myanmar text into syllables, words, phrases, and sentences. It supports multiple tokenization techniques using rule-based, statistical, and neural network-based approaches.
- Syllable Tokenization: Break text into syllables using regex rules.
- BPE and Unigram Tokenization: Leverage SentencePiece models for tokenization.
- Word Tokenization: Segment text into words using:
myWord
: Dictionary-based tokenization.CRF
: Conditional Random Fields-based tokenization.BiLSTM
: Neural network-based tokenization.
- Phrase Tokenization: Identify phrases in text using normalized pointwise mutual information (NPMI).
- Sentence Tokenization: Use a BiLSTM model to segment text into sentences.
-
Clone the repository:
git clone https://github.com/ThuraAung1601/myTokenize.git cd myTokenize
-
Install dependencies:
pip install -r requirements.txt
-
Install the library:
pip install .
from myTokenize import SyllableTokenizer
tokenizer = SyllableTokenizer()
syllables = tokenizer.tokenize("မြန်မာနိုင်ငံ။")
print(syllables) # ['မြန်', 'မာ', 'နိုင်', 'ငံ', '။']
from myTokenize import BPETokenizer
tokenizer = BPETokenizer()
tokens = tokenizer.tokenize("ရွေးကောက်ပွဲမှာနိုင်ထားတဲ့ဒေါ်နယ်ထရမ့်")
print(tokens) # ['▁ရွေးကောက်ပွဲ', 'မှာ', 'နိုင်', 'ထား', 'တဲ့', 'ဒေါ်', 'နယ်', 'ထ', 'ရ', 'မ့်']
from myTokenize import WordTokenizer
tokenizer = WordTokenizer(engine="CRF") # Use "myWord", "CRF", or "LSTM"
words = tokenizer.tokenize("မြန်မာနိုင်ငံ။")
print(words) # ['မြန်မာ', 'နိုင်ငံ', '။']
from myTokenize import PhraseTokenizer
tokenizer = PhraseTokenizer()
phrases = tokenizer.tokenize("ညာဘက်ကိုယူပြီးတော့တည့်တည့်သွားပါ")
print(phrases) # ['ညာဘက်_ကို', 'ယူ', 'ပြီး_တော့', 'တည့်တည့်', 'သွား_ပါ']
from myTokenize import SentenceTokenizer
tokenizer = SentenceTokenizer()
sentences = tokenizer.tokenize("ညာဘက်ကိုယူပြီးတော့တည့်တည့်သွားပါခင်ဗျားငါးမိနစ်လောက်ကြာလိမ့်မယ်")
print(sentences) # [['ညာ', 'ဘက်', 'ကို', 'ယူ', 'ပြီး', 'တော့', 'တည့်တည့်', 'သွား', 'ပါ'], ['ခင်ဗျား', 'ငါး', 'မိနစ်', 'လောက်', 'ကြာ', 'လိမ့်', 'မယ်']]
./myTokenize/
├── CRFTokenizer
│ └── wordseg_c2_crf.crfsuite
├── SentencePiece
│ ├── bpe_sentencepiece_model.model
│ ├── bpe_sentencepiece_model.vocab
│ ├── unigram_sentencepiece_model.model
│ └── unigram_sentencepiece_model.vocab
├── Tokenizer.py
└── myWord
├── phrase_segment.py
└── word_segment.py
- Python 3.8+
- TensorFlow
- SentencePiece
- pycrfsuite
- Numpy
This project is licensed under the MIT License. See the LICENSE file for details.
- Ye Kyaw Thu
- Thura Aung
- myWord: Syllable, Word and Phrase Segmenter for Burmese, Ye Kyaw Thu, Sept 2021, GitHub Link: https://github.com/ye-kyaw-thu/myWord
- sylbreak: Syllable segmentation tool for Myanmar language (Burmese), Ye Kyaw Thu, GitHub Link: https://github.com/ye-kyaw-thu/sylbreak
- mySentence: Corpus and Models for Burmese (Myanmar language) Sentence Tokenization, GitHub Link: https://github.com/ye-kyaw-thu/mySentence