Skip to content

Latest commit

 

History

History
41 lines (36 loc) · 1.78 KB

README.md

File metadata and controls

41 lines (36 loc) · 1.78 KB

Introduction

Code for Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition.

Usage

N-gram and trained BERT classifier cannot be public since privacy policy.

Use in command lines

python -m graces -s 饮食可睡眠可大便不规律小便正常体重无明显减轻python -m graces -f ./input.txt -o ./output.txt

Import from python

import graces
graces.cut("饮食可,睡眠可,大便不规律,小便正常,体重无明显减轻。") # Segment a single sentence
graces.cut_k("饮食可,睡眠可,大便不规律,小便正常,体重无明显减轻。", k=8) # Segment a single sentence with fixed word count k.
graces.cut_file("./input.txt", "./output.txt") # Segment a file

Data

We ask MD students to construct coarse and fine level word segmentation on EHRs for validation. We do not use data for training!

  • dev.txt: Unlabeled EHRs from part of CCKS2019.
  • dev_label_coarse.txt: Coarse-level word segmentation labels.
  • dev_label_fine.txt: Fine-level word segmentation labels.

Citation

If you find our codes or data useful, please cite:

@article{YUAN2020103542,
title = "Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition",
journal = "Journal of Biomedical Informatics",
volume = "110",
pages = "103542",
year = "2020",
issn = "1532-0464",
doi = "https://doi.org/10.1016/j.jbi.2020.103542",
url = "http://www.sciencedirect.com/science/article/pii/S1532046420301702",
author = "Zheng Yuan and Yuanhao Liu and Qiuyang Yin and Boyao Li and Xiaobin Feng and Guoming Zhang and Sheng Yu",
}