Skip to content

Latest commit

 

History

History
121 lines (100 loc) · 8.08 KB

README.md

File metadata and controls

121 lines (100 loc) · 8.08 KB

HanziGraph


996.icu LICENSE

Visualization for information about Chinese characters via Neo4j & Text augmentation via Chinese characters and words (typos, synonym, antonym, similar entity, numeric, etc.).

Introduction:

  • I try to integrate several open source Chinese characters or words corpora to build a visualized graph, named HanziGraph, for these characters, motivated by the demand to deal with character-level similarity comparison, nlp data-augumentation, and curiosity (。・ω・。)ノ

  • Furthermore, a light Chinese Text Augmentation code is provided based on several clean word-level corpora integrated by myself from open-source datasets like The Contemporary Chinese Dictionary and BigCilin.


File Dependency:

-> corpus -> char_number: 汉字笔画材料
         |-> char_part: 汉字偏旁材料
         |-> char_pronunciation: 汉字拼音材料
         |-> char_similar: 汉字结构分类、四角编码信息、形近字、音近字材料
         |-> char_split: 汉字拆字、简繁字体对照材料
         |-> basic_dictionary_similar.json
         |-> basic_triple.xlsx
         |-> corpus_handian -> word_handian: 现代汉语词典材料、生成的json数据
                           |-> get_handian.py  # 处理汉典数据的脚本
                           |-> combine_n.py  # 合并汉典与词林名词数据的脚本
         |-> corpus_dacilin -> word_dacilin: 大词林材料、生成的json数据
                           |-> get_dacilin.py  # 处理词林数据的脚本
  |-> prepro.py  # 预处理汉字各数据集的脚本
  |-> build_graph.py  # 生成汉字图谱的脚本
  |-> text_augmentation.py  # 利用词典扩增文本的脚本

Dataset

entry (i.e. each character) = {
                               "split_to(拆字方案)": ["part(偏旁) atom(子字) ...", ...],
                               "has_atom(包含哪些子字)": [atom(子字), ...],
                               "is_atom_of(为哪些字的子字)": [char(字), ...],
                               "has_part(包含哪些偏旁部首)": [part(偏旁), ...],
                               "is_part_of(为哪些字的偏旁部首)": [char(字), ...],
                               "pronunciation(有哪些读音)": [pronunciation(带音调拼音), ...],
                               "number(笔画数)": number(笔画数),
                               "is_simple_to(是哪些繁体的简体形式)": [char(字), ...],
                               "is_traditional_to(是哪些简体的繁体形式)": [char(字), ...],
                               "similar_to(有哪些近似字)": [char(字), ...]
                               }

Command Line:

  • prepro: this code is used to intergrate corpora into the large dictionary, and then transfor the dictionary to the triple-based data:
python prepro.py
  • build_graph: this code is used to transfor the triple-based data into entity/relation-based data:
python build_graph.py
  • import data into Neo4j: this code is used in command line to import entity/relation-based data to Neo4j:
./neo4j-import -into /your_path/neo4j-community-3.5.5/data/databases/graph.db/ --nodes /your_path/hanzi_entity.csv --relationships /your_path/hanzi_relation.csv --ignore-duplicate-nodes=true --ignore-missing-nodes=true
./neo4j console
  • generate dictionaries for text augmentation: using functions from get_handian.py and get_dacilin.py, and function create_corpus4typos in prepro.py to generate dictionaries for text augmentation.

  • text augmentation: using functions in this code to generate new samples. Run the code to see examples. The detailed design method could be found here.

python text_augmentation.py

Requirements

  • Python = 3.6.9
  • Neo4j = 3.5.5
  • pypinyin = 0.41.0
  • pandas = 0.22.0
  • fuzzywuzzy = 0.17.0
  • LTP 4 = 4.0.9
  • tqdm = 4.39.0

References

📖 See more ...

Cite this work as:


Use Case