Releases: hyunwoongko/pecab
v1.0.8
v1.0.7
Improve drop_space
.
v1.0.4
v1.0.3
Apply LRU cache to _tokenize
method to reduce elapse time for same inputs
v1.0.2
v1.0.0
Pecab
Pecab is pure python Korean morpheme analyzer based on Mecab. Mecab is a CRF-based morpheme analyzer made by Taku Kudo at 2011. It is very fast and accurate at the same time, which is why it is still very popular even though it is quite old. However, it is known to be one of the most tricky libraries to install, and in fact many people have had a hard time installing Mecab.
So, since a few years ago, I wanted to make a pure python version of Mecab that was easy to install while inheriting the advantages of Mecab.
Now, Pecab came out. This ensures results very similar to Mecab and at the same time easy to install. For more details, please refer the following.
Installation
pip install pecab
Usages
The user API of Pecab is inspired by KoNLPy,
a one of the most famous natural language processing package in South Korea.
1) PeCab()
: creating Pecab object.
from pecab import PeCab
pecab = PeCab()
2) morphs(text)
: splits text into morphemes.
pecab.morphs("아버지가방에들어가시다")
['아버지', '가', '방', '에', '들어가', '시', '다']
3) pos(text)
: returns morphemes and POS tags together.
pecab.pos("이것은 문장입니다.")
[('이것', 'NP'), ('은', 'JX'), ('문장', 'NNG'), ('입니다', 'VCP+EF'), ('.', 'SF')]
4) nouns(text)
: returns all nouns in the input text.
pecab.nouns("자장면을 먹을까? 짬뽕을 먹을까? 그것이 고민이로다.")
["자장면", "짬뽕", "그것", "고민"]
5) Pecab(user_dict=List[str])
: Set up a user dictionary.
Note that words included in the user dictionary cannot contain spaces.
- Without
user_dict
from pecab import PeCab
pecab = PeCab()
pecab.pos("저는 삼성디지털프라자에서 지펠냉장고를 샀어요.")
[('저', 'NP'), ('는', 'JX'), ('삼성', 'NNP'), ('디지털', 'NNP'), ('프라자', 'NNP'), ('에서', 'JKB'), ('지', 'NNP'), ('펠', 'NNP'), ('냉장고', 'NNG'), ('를', 'JKO'), ('샀', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]
- With
user_dict
from pecab import PeCab
user_dict = ["삼성디지털프라자", "지펠냉장고"]
pecab = PeCab(user_dict=user_dict)
pecab.pos("저는 삼성디지털프라자에서 지펠냉장고를 샀어요.")
[('저', 'NP'), ('는', 'JX'), ('삼성디지털프라자', 'NNG'), ('에서', 'JKB'), ('지펠냉장고', 'NNG'), ('를', 'JKO'), ('샀', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]
6) PeCab(split_compound=bool)
: Divide compound words into smaller pieces.
from pecab import PeCab
pecab = PeCab(split_compound=True)
pecab.morphs("가벼운 냉장고를 샀어요.")
['가볍', 'ᆫ', '냉장', '고', '를', '사', 'ㅏㅆ', '어요', '.']
7) ANY_PECAB_FUNCTION(text, drop_space=bool)
: Determines whether spaces are returned or not.
This can be used for all of morphs
, pos
, nouns
. default value of this is True
.
from pecab import PeCab
pecab = PeCab()
pecab.pos("토끼정에서 크림 우동을 시켰어요.")
[('토끼', 'NNG'), ('정', 'NNG'), ('에서', 'JKB'), ('크림', 'NNG'), ('우동', 'NNG'), ('을', 'JKO'), ('시켰', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]
pecab.pos("토끼정에서 크림 우동을 시켰어요.", drop_space=False)
[('토끼', 'NNG'), ('정', 'NNG'), ('에서', 'JKB'), (' ', 'SP'), ('크림', 'NNG'), (' ', 'SP'), ('우동', 'NNG'), ('을', 'JKO'), (' ', 'SP'), ('시켰', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]