unihan-lm

The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020

Pretrained Model

The pretrained model is available at 🤗 Hugging Face Model Hub: https://huggingface.co/microsoft/unihanlm-base

We have made the code to find the Unihan clusters and the cached cluster IDs available here.

Please follow our paper and use the training code from facebookresearch/XLM.

Preprocess your corpus by replacing all characters with the first character in each cluster.
After cluster-level pretraining, copy the embedding of the first characters in each cluster for other characters in the same cluster.

for cluster in clusters:
    for chracter_id in cluster[1:]:
        embedding.weight[chracter_id] = embedding.weight[cluster[0]].detach()

Note: XLM is released under a CC BY-NC 4.0 licence.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
unihan		unihan
LICENSE		LICENSE
README.md		README.md