SpaCy (version > 2) models for Chinese language. Those models are rough and still working in prograss. But "Something is Better Than Nothing".
An online jupyter notebook / demo is provided at .
Partial attributes of a Doc
object for 王小明在北京的清华大学读书
:
NER of a Doc
object for 王小明在北京的清华大学读书
:
Models are released as binary file, users should know basic knowledge of using SpaCy version 2+.
Python 3 (maybe python2, but currently not well tested)
Download relased model from releases
.
wget -c https://github.com/howl-anderson/Chinese_models_for_SpaCy/releases/download/v2.0.4/zh_core_web_sm-2.0.4.tar.gz
then install model
pip install zh_core_web_sm-2.0.4.tar.gz
test.py
contains demo codes. After install the model, user can download or clone this repo then execute:
python3 ./test.py
then, open web browser to http://127.0.0.1:5000
, user will see image simllar to this:
See workflow
The corpus data used in this project is OntoNotes 5.0。
Since OntoNotes 5.0 is copyright material of LDC (Linguistic Data Consortium) . This project can not include the daa directly。Good news is OntoNotes 5.0 is free to organizer user, you can set up a count for your company or school, then you can get the OntoNotes 5.0 at no cost。
- Attribute
pos_
is not working correctly. This related to Language class in SpaCy. - Attribute
shape_
andis_alpha
seems meaningless for Chinese, need make sure of it. - Attribute
is_stop
is not working correctly. This related to Language class in SpaCy. - Attribute
vector
seems not well trained Attributeis_oov
is totally incorrect. First priority.NER model is not available due to lacking of LDC corpus. I am working on it.- Release all the intermediate material to help user build own model
- TODO
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
We use SemVer for versioning. For the versions available, see the tags
on this repository.
- Xiaoquan Kong - Initial work - howl-anderson
See also the list of contributors
who participated in this project.
This project is licensed under the MIT License - see the LICENSE.md file for details
- TODO